PaperHub
4.2
/10
Poster5 位审稿人
最低1最高4标准差1.0
4
1
2
2
3
ICML 2025

One Example Shown, Many Concepts Known! Counterexample-Driven Conceptual Reasoning in Mathematical LLMs

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

The first work aims to enable LLMs to perform mathematical reasoning by giving counterexamples.

摘要

关键词
Large Language ModelsMathematical Reasoning

评审与讨论

审稿意见
4

The paper presents CounterMath, a novel benchmark to assess and enhance LLM's ability to reason through counterexamples in mathematical proofs. Inspired by the pedagogical method of learning by counterexamples, the work introduces a dataset of university-level mathematical statements requiring counterexample-driven disproving. The authors also propose a data engineering pipeline to generate counterexample-based training data and demonstrate through empirical evaluation that LLMs, including OpenAI o1, struggle with counterexample-driven conceptual reasoning. Overall, this paper highlights the importance of counterexample reasoning in improving LLMs' overall mathematical capabilities and provides insights into potential future research directions in modern/research-level reasoning.

给作者的问题

  1. Have you considered applying COUNTERMATH to a Lean-based proof assistant (or Isabelle, Coq, etc) to test whether counterexample-driven learning aids in formal symbolic reasoning?
  2. How does the model perform on longer proofs (like theorems in research-level math papers) that require lots of counterexamples? Is reasoning depth a bottleneck?

论据与证据

This paper claims that LLMs rely heavily on exposure to existing proof patterns, which limits their generalization to novel mathematical statements. This is supported by the evaluation of LLMs on the CounterMath dataset which emphasizes counterexample-driven proofs. Even the SOTA models struggle with the benchmark and validating their hypothesis. Additionally, they provide fine-grained analysis across different mathematical fields, demonstrating that real analysis and topology pose particular challenges. The evidence is well-supported through rigorous experimentation, statistical evaluation, and comparisons with related mathematical benchmarks (e.g., PutnamBench, MATH, GSM8K).

方法与评估标准

Overall, the methodology for dataset curation is well-structured and clearly articulated. The problems are selected from university-level textbooks and validated through expert review, ensuring high-quality data. The evaluation criteria include:

  1. F1 lexical matching score, which measures correctness in determining the truth value of statements.
  2. Frequency that models generate counterexamples.
  3. Strict and Loose Alignment, assessing whether generated examples align with reference solutions.
  4. Testing performance on MATH and GSM8K to assess OOD generalization.

理论论述

This paper does not make any theoretical claim.

实验设计与分析

The experiments include baseline comparisons across various mathematical LLMs, including proprietary (gpt4o) and open sourced models (e.g. Deepseek-Math-7B-RL). The performance is breakdowned by mathematical fields (e.g. Algebra, Functional Analysis, Real Analysis, Topology). Furthermore, this paper conducts fine-tuning experiments on Qwen2.5-Math-7B-Instruct to demonstrate the effectiveness of counterexample training.

On the other hand, while the results indicate that counterexample-based reasoning is crucial, the experiments lack an ablation study comparing different types of proof strategies. Including such an analysis could strengthen the claim that counterexample-driven reasoning is uniquely impactful. The claim that improving counterexample reasoning enhances general mathematical reasoning is supported by out-of-distribution (OOD) evaluations, but additional benchmarks from formal theorem proving could further validate this claim.

补充材料

I reviewed the supplementary material, which contains the code of all evaluation/train scripts and datasets mentioned in this paper.

与现有文献的关系

This paper is closely related to formal theorem proving and provides a promising approach towards automating modern mathematical solving

遗漏的重要参考文献

N/A

其他优缺点

Strength:

  1. This paper introduces a novel benchmark that addresses a crucial gap in counter-examples in mathematical reasoning. The dataset is rigorously validated with valuable insights into mathematical LLMs' limitations.
  2. The training methodology shows that counterexample-driven learning is effective in improving reasoning.

Weakness: While the results indicate that counterexample-based reasoning is crucial, the experiments lack an ablation study comparing different types of proof strategies (e.g. direct proof, proof by contradiction, proof by construction). Including such an analysis could strengthen the claim that counterexample-driven reasoning is uniquely impactful.

其他意见或建议

N/A

作者回复

Response to Reviewer tdTp

Thank you for your efforts in the review, and hope our response will address your concerns!

Supported by out-of-distribution (OOD) evaluations, but additional benchmarks from formal theorem proving could further validate this claim.

To further validate our approach, we expanded our evaluation to include higher-difficulty datasets at the competition level (OlympiadBench-Math, AIME 2024) and university level (MMLU-College-Math). The accuracy results are summarized below:

ModelGSM8KMATHOlympiadBench-MathMMLU-College-MathAIME 2024
Qwen2.5-Math-Instruct-7B95.180.541.674.020.0
Our SFT Model95.687.946.480.030.0

These results show significant improvements on both simpler and more complex OOD datasets, supporting our hypothesis that current LLMs lack the ability to effectively leverage counterexamples for mathematical reasoning.

While the results indicate that counterexample-based reasoning is crucial, the experiments lack an ablation study comparing different types of proof strategies (e.g., direct proof, proof by contradiction, proof by construction). Including such an analysis could strengthen the claim that counterexample-driven reasoning is uniquely impactful.

We appreciate this insightful suggestion. In our work, we conducted a preliminary analysis using different prompting strategies (CoT, ICL, and Hint) to assess their impact on example generation. Due to space limitations, we provided only a brief overview. We plan to conduct a more detailed ablation study in future work and hope our current analysis will encourage further research in this area.

Have you considered applying COUNTERMATH to a Lean-based proof assistant (or Isabelle, Coq, etc.) to test whether counterexample-driven learning aids in formal symbolic reasoning?

Our decision not to use formal tools stems from a mismatch with COUNTERMATH’s natural language rationale annotations. Formal data generated by automated tools rely heavily on autoformalization models—which currently show inconsistent performance on datasets like GSM8K and MATH. However, we agree that exploring formal generation of counterexamples is a promising direction for future research.

How does the model perform on longer proofs (like theorems in research-level math papers) that require lots of counterexamples? Is reasoning depth a bottleneck?

Our benchmark is derived from university-level mathematics textbooks, covering key definitions, theorems, properties, and axioms. In many cases, the model incrementally builds its reasoning by adding constraints and refining assumptions. While this demonstrates some reasoning depth, research-level theorems—requiring extensive counterexamples and multi-step proofs—pose a significant challenge. We hypothesize that tackling such complex proofs requires the LLM to possess multiple capabilities, and reasoning depth does emerge as a bottleneck for LLMs in these scenarios.

Moreover, as seen in models like o1 and Deepseek-R1, an excessively prolonged reasoning process can lead to a phenomenon where the model "overthinks" and becomes trapped in a loop of iterative self-correction. This results in reasoning fallacies, where the model repeatedly attempts to refine its initial claim but ultimately diverges further from a correct solution. We encountered this specific issue during our evaluation of Deepseek-R1 on the COUNTERMATH benchmark, resulting in more than 500s thinking time per question.

Addressing these research-level reasoning challenges requires a multi-faceted approach. Further model optimization like SFT and Reinforcement Learning can improve the model’s understanding of mathematical concepts, its proficiency in using formal languages for structured reasoning, and its ability to generate logically consistent counterexamples. We believe that continued advancements in these areas will significantly enhance LLMs' performance on complex mathematical proofs.

Thank you once again for your valuable question, which really encourage us to further improve our research.

审稿人评论

Thanks for authors' detailed reply. I will keep my rating of 4.

作者评论

We are very glad that our reply satisfies you, and your encouragement is also our greatest motivation. We believe that our work can provide new perspectives for the mathematics LLMs community and highlight the key role of counterexample-driven reasoning in improving the mathematical ability of LLMs.

审稿意见
1

The paper targets solving mathematical problems with LLMs, and focuses specifically on proofs by counterexample. The authors introduce a novel benchmark: CounterMATH. It contains 1216 examples of mathematical statements that are (dis)proved via showing a counterexample. The pipeline of creating the dataset is demonstrated. Multiple LLMs are evaluated on the benchmark, showing that it is challenging. Additionally, the authors extract counterexample-based proofs from existing benchmarks to finetune Qwen2.5-Math-7B-Instruct and show improved performance of the finetuned model on MATH and GSM8K datasets.

Update after rebuttal

I decided to keep my score unchanged. I think the counterexample-style reasoning is definitely an interesting topic, but I also think that the paper in its current shape should not be accepted. The writing is imprecise and has too much of a sales-pitch aesthetic (for instance: the false dichotomy between the "drill-based reasoning" aka direct proofs and supposedly superior counterexample-based reasoning). There are also other problems, like the validity of the "hint prompting". I suggest the authors to make the paper more solid and resubmit.

给作者的问题

  1. In Table 1, some models have higher F1 but lower Examples (%) -- for instance, Qwen2.5-Math-7B-Instruct vs OpenAI o1-preview. As all the examples in the benchmark require a proof by counterexample, I find it strange. Could you explain it? Could you show some concrete examples where the model provided good answer but didn't use counterexamples?
  2. Did you perform inspection of the quality of the counterexamples produced by some LLMs? Could you show some LLM-produced counterexamples which do not appear in CounterMATH? Or counterexamples which are incorrect?

论据与证据

I find two main claims in the paper: (1) proving with counterexamples is difficult for LLMs, (2) LLMs fine-tuned on counterexample-based proofs is helpful for improving "reasoning skills" of LLMs in general.

Claim (1) is supported by strong evidence with experiments with multiple LLMs evaluated on CounterMATH.

Claim (2) has rather weak experimental evidence: the authors fine-tune only one model and evaluate it on just two math benchmarks. Additionally, on one of them, GSM8K, the improvement is small (95.1% vs 95.6%).

方法与评估标准

In general, the methodology seems to be sound. However, I'm not sure about this aspect of the pipeline:

When you evaluate LLMs on CounterMATH, you say that "we adopt a Hint prompt, designed to encourage LLMs to provide examples when attempting to solve problems". But when you ask about whether a math statement is true or false and at the same time say that a proof by counterexample is expected, this means that in many (all?) cases one can infer the truth value without any reasoning.

If we have a math statement X, typically only one of the two alternatives, X, neg(X), can be proved by showing a counterexample.

For instance, if one asks me whether this in true

"Every subgroup $H$ of an abelian group $G$ is a normal subgroup."

and I know that I should use a proof by counterexample, I know that the statement must be false.

Could you comment on that?

理论论述

There are no theoretical claims in the paper.

实验设计与分析

See above (the issue with the Hint prompt).

补充材料

The authors supplement some code and data. However, the code is not documented and I was unable to verify if it's functional and what is it for. The attached data contain only 30 examples from the full CounterMATH dataset. Each example consists of a statement (in English and Chinese), a rationale (which is a proof by counterexample; also in the two languages), and a binary label. I'm not sure how to interpret the meaning of the binary label. Perhaps it indicates if the statement is true; for instance

"statement": "The smallest number ring containing $\\sqrt{3}$ is the real number ring."

is labeled as false. However, some statements cannot have binary truth value, e.g.,

"statement": "Examples where the $n(n \\geqslant 3)$-ary associative law does not hold",

so I'm not sure about the quality of the dataset.

与现有文献的关系

The paper proposes a new mathematical benchmark / dataset and therefore is related to multiple other recent paper proposing similar math benchmarks. The benchmark proposed by the authors has a clear distinguishing feature, namely it focuses on proofs by counterexample, and I'm not aware of another benchmark focusing on this aspect.

遗漏的重要参考文献

The authors in general discuss relevant related literature.

其他优缺点

Strengths

  1. The authors target an interesting and challenging problem: counterexample-based proving.
  2. The authors evaluate many (25) different LLMs on their benchmark.
  3. The authors provide some interesting analyses beyond assessing accuracy, like number of tokens produced.

Weaknesses

  1. The authors did not disclose the CounterMATH benchmark which would allow to asses its quality.
  2. The writing is quite poor or imprecise in some places. For example: "Conceptual Finetuning" -- why such a phrase for finetuning on counterexample-based proofs? Or "statement is True or False by its rationale" -- what does it mean?
  3. The authors present a narrative that the counterexample-based reasoning constitutes more important, deeper form of reasoning; they write that learning from non-counterexample proofs is "drill-based" learning, and that "example-based learning is a more important strategy than drill-based learning." Although I agree that counterexample-based proving is important and challenging, such claims are completely unjustified and false.

其他意见或建议

Figure 5 is difficult to read -- a simple table or a bar plot would be better in my opinion.

作者回复

Response to Reviewer vkWZ

Claim (2) has weak experimental evidence: only one model fine-tuned, evaluated on two math benchmarks.

Our primary goal is to introduce a benchmark and explore LLMs’ math capability in providing counterexamples. We observed that even state-of-the-art LLMs struggle with this task. Therefore, we fine-tune the typical model, Qwen2.5-Math-Instruct, to enhance its counterexample reasoning abilities. To strengthen our claim, we expanded our evaluation to more challenging datasets, including Competition-level (OlympiadBench-Math, AIME 2024) and University-level (MMLU-College-Math).

ModelGSM8KMATHOlympiadBench-MathMMLU-College-MathAIME 2024
Qwen2.5-Math-Instruct-7B95.180.541.674.020.0
Our SFT Model95.687.946.480.030.0

These results demonstrate significant improvements on both simpler and complex datasets, supporting our hypothesis that learning from counterexamples enhances performance.

Concern about "Hint prompt".

We observe that directly prompting LLMs for true/false judgments led to poor performance and random guessing. Therefore, we hope to use the Hint prompt to inspire LLMs to "give examples", instead of assuming the truth. It guides LLM to prove in a more reasonable way of thinking, which aims to let the model reason by giving examples.

Code documentation and dataset quality.

Thank you for your meticulous review. Regarding the second statement you mentioned, we classify True/False based on the existence of examples: False if no such examples exist, and True if at least one example exists, and framing these with binary labels enables more efficient evaluation. As for dataset quality, COUNTERMATH was initially from textbooks and verified by experts to ensure all statement-rationale pairs are unambiguous and reasonably categorizable, as detailed in Sections 3.1, 3.2, and the Appendix.

Lack of benchmark disclosure for quality assessment.

Sorry for only providing the sampled dataset during the anonymous period. The main reason is that we plan to set up a public Leaderboard, so it is not appropriate to disclose all data. In addition, providing sampled datasets during the anonymous review stage does not violate any rules (in fact, this is a common practice, which can protect from being leaked prematurely and facilitate reviewers to review samples).

Writing clarity issues.

We apologize for any confusion, and we will ensure clearer phrasing in the final version. Key clarifications:

  • "Conceptual Finetuning" refers to fine-tuning on annotated, complex reasoning data.
  • "Statement is True or False by its rationale" means determining validity based on supporting reasoning.

Claim that counterexample-based reasoning is more important is unjustified.

Sorry for the confusion caused here. We aim to emphasize that drill-based learning alone is insufficient for developing deep mathematical understanding, and fostering deeper mathematical understanding. While drills transfer knowledge, counterexample-based reasoning enhances problem-solving and concept validation. We appreciate your observation and will revise Line 51 to avoid overclaims.

higher F1 but lower Examples (%)

  1. Sometimes the judgment is correct, but the reasoning process is not rigorous.

Example 1:

  • Statement: The smallest number ring containing 3\sqrt{3} is the real number ring.
  • Answer: False
  • Model Prediction: False (Correct)
  • Model Output: Incorrectly discusses Banach spaces and the RNP instead of relevant number theory concepts.
  • Explanation: The model guessed the correct answer but provided an irrelevant rationale, resulting in a correct judgment but a poor Examples score.
  1. Constructing effective counterexamples requires a comprehensive master of mathematical knowledge.

Example 2:

  • Statement: All countably compact spaces are compact.
  • Model Output: Proposes an invalid counterexample based on Hilbert cube subsets.
  • Explanation: The Hilbert cube itself is compact, and its closed subsets inherit compactness. Thus, this counterexample is self-defeating.

Inspection of model-generated counterexamples.

Here is an LLM-generated counterexample:

  • Statement: For a fixed element aa of the ring RR, N={rarR}N = \{ra \mid r \in R\} forms an ideal of RR but does not necessarily contain aa.
  • Reference Example: RR as the ring of even numbers; N={r4rR}N = \{r \cdot 4 \mid r \in R\} is an ideal, but 4N4 \notin N.
  • LLM Output: The model analyzed a trivial ring R={0}R = \{0\} where the counterexample was consistent with evaluation criteria.

The LLM’s counterexample differs from the reference but is marked as "consistent" because it satisfies the evaluation criteria. Our judge model successfully extracts and assesses these counterexamples, determining their consistency with the intended logic.

审稿人评论

Thank you for the rebuttal and additional experiments; I have some additional questions below.

"Statement is True or False by its rationale" means determining validity based on supporting reasoning.

I still don't understand what do you mean by "Statement is True or False by its rationale". A mathematical statement can be either true or false; also, a proof of a true math statement can be correct or incorrect. So is a true statement with an incorrect proof "a statement false by its rationale"? Please, explain it more, and I also strongly suggest changing this sentence to something more clear.

[Hint prompting] guides LLM to prove in a more reasonable way of thinking, which aims to let the model reason by giving examples.

Ok, but do you also agree that the hint prompting reveals the truth value of the statement in question? I think that yes, and it may be problematic in evaluation.

[...] all statement-rationale pairs are unambiguous and reasonably categorizable

Ok, but what about the example I gave, which is not a mathematical statement with a binary truth value? Namely this one:

"statement": "Examples where the $n(n \\geqslant 3)$-ary associative law does not hold",

[...] drill-based learning alone is insufficient for developing deep mathematical understanding, and fostering deeper mathematical understanding.

I still don't know how learning proving theorems in a "direct" way, without counter-examples, is "drill-based learning." I'd say that it is equally difficult and reasoning-intensive as counter-example-based proving.

作者评论

Thank you for your feedback. Due to the limited length of the rebuttal (only 5000 characters), we regret that we cannot explain some of the details clearly. We appreciate your comments and believe these clarifications will strengthen our paper.

Confusion about "Statement is True or False by its rationale".

We acknowledge the ambiguity in our original writing and appreciate this opportunity for clarification. This is a sentence in the bracket, which is a complementary explanation of the concept of "judgement". Specifically, it means that the judgment expresses whether the statement is true or false, and this judgment is supported by the corresponding rationale. Here is a presentation of the data we construct and a specific explanation of each value, and the data has been rigorously screened and verified to ensure the correctness of rationale and its judgement.

Question about [Hint prompting]: Do you also agree that the hint prompting reveals the truth value of the statement in question?

Our hint prompt (see Appendix B, p. 13) only prompt model to give examples:

Please reason by giving examples about whether the above statement is True or False.

This prompt does not disclose the statement’s truth value. It simply encourages the model to illustrate its reasoning via examples, including both positive examples (e.g., indicates that some kind of function exists) and counter‑examples (e.g., give special examples that do not meet the constraints of the statement). In other words, we only guide the method of proof (example‑based reasoning), not the answer itself.

[...] all statement-rationale pairs are unambiguous and reasonably categorizable. Ok, but what about the example I gave, which is not a mathematical statement with a binary truth value? Namely this one:
"statement": "Examples where the n(ngeqslant3)n(n \\geqslant 3)-ary associative law does not hold".

The example statement "Examples where the n(ngeqslant3)n(n \\geqslant 3) associative law does not hold" does indeed possess a well-defined binary truth value when properly interpreted in the mathematical context. This statement means that there are examples for which any n(ngeqslant3)n(n \\geqslant 3)-ary associative law does not hold. In our data, the statement is a binary truth value, False if no such examples exist, and True if at least one example exists. Our data curation pipeline involved rigorous validation by domain experts to ensure all statements admit unambiguous truth values, align with established mathematical conventions, and maintain self-contained logical structure. Moreover, from the perspective of model output, the model can also correctly judge the binary values of the statement, so as to output corresponding binary judgments (whether they are true or false).

I still don't know how learning proving theorems in a "direct" way, without counter-examples, is "drill-based learning." I'd say that it is equally difficult and reasoning-intensive as counter-example-based proving.

We agree that direct theorem proving is challenging and requires strong reasoning skills. When we refer to "drill-based learning," we are describing the common practice seen in most math textbooks and large-scale training materials in recent math LLM where proofs follow a step-by-step process—starting from basic assumptions and logically building up to conclusions. While this approach works well for teaching standard methods, it might unintentionally encourage LLMs to focus too much on memorizing common proof patterns rather than truly understanding the logic behind them. For example, models trained this way might handle familiar problems smoothly but struggle when facing slightly modified questions that require adapting core principles in new ways.

Counterexample-based reasoning takes a different path. Instead of building up proofs from assumptions, it starts by questioning the conclusion and tries to find specific examples that break the original statement's rules. Imagine a student proving a math claim is false by constructing a simple example that violates it—this method focuses on analyzing constraints and creatively testing edge cases. It challenges models to actively explore the boundaries of mathematical rules rather than mechanically follow predefined proof steps. This contrast helps us better distinguish between models that merely copy textbook solutions and those that genuinely grasp when and why certain mathematical principles apply.

审稿意见
2

The paper presents new mathematical benchmark which tests the counterexample-based proof generation ability of LLMs across several sub-areas of math. The overall experimental results show that today's LLMs generally have low scores when trying to solve these problems with counter-example based reasoning. Further, the authors experiment with fine tuning a model to improve the scores, however, the results are borderline.

给作者的问题

I already mentioned few questions earlier.

论据与证据

The claims are well supported by the paper.

方法与评估标准

Yes, the proposed benchmark makes sense. However, from the paper, it would have been good if the authors had shown few examples of exactly how the data points in the benchmark look like. I looked at the supplementary material, and that was lacking there as well.

理论论述

This is an experimental paper, no theoretical contribution.

实验设计与分析

Yes, the experiments overall make sense.

补充材料

Yes, all parts.

与现有文献的关系

Measuring the performance of language models on mathematical tasks is a very active area of research. I have not yet seen a comprehensive benchmark which tests the LLMs ability on example-based proofs, so this is a useful contribution.

遗漏的重要参考文献

Would have been good if the authors had discussed a bit more on contamination. The problems they select and the reference solutions, perhaps it would be good, if a contamination test on the base models, was done?

其他优缺点

I believe this is a solid contribution, although the technical depth is not high. Couple of questions:

  • did the authors look at EN-based textbooks for such proofs?
  • did the authors consider generating counter-example based dataset based on formal tools? These tools, are generally good for producing counter-examples to conjectures when the problem is encoded?
  • it was not entirely clear, especially given the low scores, whether the authors considered simpler competitions / problems that can be solved with examples? They would also be more amenable to generating data with formal provers (simpler instance would be anything SAT/SMT encodable).
  • The results with the fine-tuning are somewhat mixed. Do you think that doing RL instead of standard SFT training would improve the results better? It would have been good to see such an experiment.
  • It would have been nice to see a bit more of an evaluation of where the models fail, e.g. it it due to wrong reasoning steps, or miscalculation errors, or deeper issues, when deriving the example?

其他意见或建议

I think it would be very useful if the authors, early on, provide an end to end example (simplified) for a data point in the benchmark dataset and an illustration of the model succeeding and failing, i.e., a reason. Even when this is explained (rationale, judgement, etc), it needs more elaboration what each item is.

作者回复

Response to Reviewer bVLs

We sincerely appreciate your efforts in the review and hope our response will address your concerns.

Did the authors look at EN-based textbooks for such proofs?

Yes. While our annotators are native Chinese speakers and thus did not annotate directly from English textbooks, our core authors aligned the English version of COUNTERMATH with standard EN-based textbooks during quality control. Additionally, many counterexamples in our Chinese sources originate from English works like Counter-Examples in Topology (Steen & Seebach) and Counterexamples in Topological Vector Spaces (Khaleelulla). We are confident that our dataset maintains cross-language representativeness.

Did the authors consider generating a counterexample-based dataset using formal tools?

We chose not to use formal tools because the resulting data would conflict with the natural language rationale annotations in COUNTERMATH. Using formal tools risks misalignment between generated counterexamples and the human-provided explanations, and their quality depends heavily on the autoformalization model—which, given current performance on datasets like GSM8K and MATH, remains inconsistent. However, we agree that formal generation of counterexamples is an intriguing direction for future work.

It was not clear if the authors considered simpler problems that can be solved with examples, which would be more amenable to formal provers (e.g., SAT/SMT encodable instances).

We have indeed considered simpler math datasets such as GSM8K and MATH. Our experimental results (Table 3) show that counterexample-driven reasoning yields significant improvements on these elementary math scenarios—demonstrating that even 7B-scale models can outperform 72B-scale models when leveraging examples effectively.

Do you think that using RL instead of standard SFT training would yield better results?

We share your view on the potential of RL. Unfortunately, our current GPU resources (only 2 L20 48G GPUs) limited our ability to conduct effective RL experiments for LLMs. Moreover, our primary goal was to emphasize counterexample-driven reasoning and introduce COUNTERMATH. We believe that further exploration—including RL-based training—remains a promising direction for future research.

It would have been nice to see a bit more of an evaluation of where the models fail, e.g., due to reasoning errors, miscalculations, or deeper issues?

This is a very interesting and important research direction. We did attempt to evaluate the reasoning process of the model in order to identify the precise step where errors begin. However, the complexity of the prompts led to instability in the output of the judge model. For example:

Statement: There exists a divergent series, which can be rearranged to arbitrarily slow down its divergence. Model Output: Correct understanding of divergent series, but mistakenly applies properties of conditionally convergent series. "First, let's recall that a divergent series is a series whose sequence of partial sums does not converge to a finite limit...it is possible to construct a divergent series whose divergence can be arbitrarily slowed down by rearranging its terms... Therefore, it is indeed possible to rearrange a divergent series..."

Human Judgment: False
LLM Judgment: True

Given these inconsistencies, we used the Example evaluation approach. Future research will explore stepwise error analysis, like using reward models to more systematically analyze the causes of the model's failures.

Provide an end-to-end example to clarify benchmark components and model performance.

We appreciate the reviewer’s suggestion and have refined the examples to better illustrate our benchmark’s utility and model performance, highlighting model successes and failures:

Example 1

  • Statement: "The smallest number ring containing 3\sqrt{3} is the real number ring."
  • Answer: False
  • Model Prediction: False (Correct)
  • Model Output: "To determine whether Lp[0,1]L^{p}[0,1] and lpl^{p} are linearly isomorphic for p2p \neq 2..."
  • Correct Rationale: "The minimal ring containing 3\sqrt{3} is Z[3]\mathbb{Z}[\sqrt{3}]."
  • Model Issue: The model correctly answered "False" but hallucinated irrelevant concepts (e.g., Banach spaces) unrelated to ring theory.

Example 2

  • Statement: "There exist rings without identity elements, and there are cases with left or right identity but not both."
  • Answer: True
  • Model Prediction: False (Incorrect)
  • Model Output: "We aim to determine whether there exists a bilinear functional φ(x,y)\varphi(x, y)..."
  • Correct Rationale: "A finite 4-element ring can have left identities without right identities."
  • Model Issue: The model entirely misinterpreted the problem type, discussing bilinear functionals instead of ring-theoretic identities.
审稿意见
2

The paper introduces a benchmark called COUNTERMATH. It is designed to assess the ability of LLMs to reason about mathematical statements and justify them using counterexamples. The dataset comprises statement-rationale pairs, sourced from math textbooks, and undergoes manual filtering to ensure quality. The evaluation framework includes F1 for assessing the correctness of model judgments on math statements and introduces three new metrics: (a) proportion of examples, (b) strict alignment, (c) loose alignment. The three metrics are used to measure how effectively models use counterexamples in their reasoning.

The paper presents a data engineering framework for obtaining counterexample-based math proofs for fine-tuning. This framework leverages GPT-4o to filter and refine proofs to ensure that the training data aligns with the conceptual reasoning requirements of COUNTERMATH.

The paper evaluates a range of open-weight and proprietary models on COUNTERMATH, revealing that open-weight models have significantly lower performance than proprietary models. The proprietary models still perform poorly compared to benchmarks focused on simpler mathematical problems. Furthermore, the results demonstrate that fine-tuning on counterexample-based math proofs, combined with hint-based prompting, leads to an improvement in F1 in COUNTERMATH, outperforming the base model.

update after rebuttal:

I will maintain my score of 2. The model fine-tuned on counterexample based math proofs shows only marginal improvements over model fine-tuned on non-counterexample proofs, and this doesn't sufficiently demonstrate that learning from counterexamples is valuable for improving model's math abilities.

给作者的问题

  1. When evaluating consistency of generated examples by models and reference examples, will a generated example be marked as consistent if it conveys the same meaning with at least one or all of the reference examples if there are multiple reference examples?
  2. In table 3, are the numbers showing the F1 scores?

论据与证据

While the paper claims that counterexample-driven learning improved performance of the fine-tuned model on COUNTERMATH, my concern is that the observed gains might not be attributable to the use of counterexample-based datasets. The fine-tuning process involves exposing the model to 1025 high-quality proofs using counterexamples. Without experiments using a similar number of math proofs with similar difficulties that are not structured around counterexamples, it is difficult to determine whether the performance boost arises specifically from counterexample-driven learning or from increased exposure to more material.

方法与评估标准

Overall the proposed benchmark makes sense. However, in the metrics for example alignment of COUNTERMATH, if a model generates valid counterexamples that differ from the provided reference, the evaluation might mark them as inconsistent based on the current evaluation prompt. This could penalize correct reasoning simply because the output diverges from the reference form, even though the counterexamples are mathematically sound. Moreover, the reliability of this automated evaluation of example alignment remains questionable as the reported accuracy of 93.5% for using GPT-4o at evaluating alignment compared to human judgments.

理论论述

The paper is primarily empirical and does not present formal proofs.

实验设计与分析

The experiments are thoughtfully constructed. Various open-weights LLMs and proprietary models are evaluated on COUNTERMATH with CoT prompts. Qwen-2.5-Math-7B-Instruct is fine-tuned on the counterexample-based math proofs, and is evaluated on both COUNTERMATH dataset and out-of-distribution datasets like MATH from (Hendrycks et al., 2021) and GSM8K from (Cobbe et al., 2021). Both the open-weights models and proprietary models achieve low F1, showing the difficulty of this dataset. Although the fine-tuned model achieves higher F1 on COUNTERMATH, the original base model outperforms the fine-tuned model on Strict/Loose Align metrics. This might suggest that by fine-tuning on the new counterexample-based dataset, counterexamples generated don’t align well with the counterexamples in the reference solution. The authors explain this drop in performance with the discrepancies of the dataset used for fine-tuning and COUNTERMATH. More detailed analysis should be provided to determine if the drop in alignment with the reference counterexamples causes an issue in the reasoning process or not.

补充材料

I’ve reviewed the samples of dataset and the prompts.

与现有文献的关系

This paper focuses on counterexample based benchmark, which is claimed to better reflect understanding of math concepts than previous benchmarks that focus on solving simple math problems. By fine-tuning on counterexample-based math proofs, the model shows better performance on the proposed benchmark and out-of-distribution tasks compared to the base model. The results also show that the current open-weights LLMs struggle with utilizing counterexamples to reason, and perform badly in topics like topology and real analysis.

遗漏的重要参考文献

The paper offers insights into counterexample based reasoning through natural language proofs, but it overlooks formal theorem provers like LeanDojo from (Yang et al., 2023) that employs programming languages like Lean4 (Moura & Ullrich, 2021). Given that formal proofs benefit from more objective evaluation, a discussion on the choice to focus on natural language proofs, and the challenges associated with their evaluation would help contextualize the paper’s contributions.

其他优缺点

The counterexample based reasoning benchmark is a resource for evaluating the ability of LLMs to understand complex math concepts. The paper shows improvements on F1 score for proving statements in COUNTERMATH and improvements on the ability to use counterexamples in proofs by fine-tuning on counterexample-based datasets.

Some weaknesses:

  1. No comparison with model fine-tuned on dataset of non-counterexample-based math proofs makes it unclear whether improvements are due to counterexample-based fine-tuning or simply increased exposure to advanced material.
  2. The current evaluation prompt may overly penalize valid and mathematically sound counterexamples that differ from the reference, and mark them as inconsistent.
  3. Using GPT-4o for evaluating alignment with a reported 93.5% accuracy relative to human judgment raises concerns about potential misclassifications.
  4. Although the paper shows there is a gain in performance on the out-of-distribution datasets by fine-tuning on counterexample proofs, the OOD datasets used are described as high school competition level and grade school level. I am concerned whether this improvement generalizes to more challenging OOD datasets.
  5. Training settings are not described(even in the Appendix).

其他意见或建议

Some additional comments:

  1. The paper claims that although o1 model achieves the best performance in COUNTERMATH F1 score, the performance is still significantly lower than what it achieved on elementary and high school mathematics benchmarks. It is helpful to add the performance of o1 model on simpler benchmarks for readers to see the difference.

  2. Hint prompt is only used with Qwen2.5-Math-7B-Instruct but not any other models. It is helpful to see how hint prompt can influence the performance of different models on COUNTERMATH dataset and metrics.

作者回复

Response to Reviewer KrNF

No comparison with model fine-tuned on dataset of non-counterexample-based math proofs.

We agree that such a comparison would further validate our approach. Since our primary motivation was to explore how LLMs use counterexamples in proofs, we initially evaluated only counterexample-based fine-tuning. To address your suggestion, we randomly selected 1,025 non-counterexample samples from the NaturalProofs dataset (the same source as our counterexample data) and fine-tuned Qwen2.5-Math-Instruct under identical settings. Results are as follows:

ModelOverall F-1Examples (%)
Base Model38.374.2
Counterexample SFT Model39.775.2
Non-counterexample SFT Model37.974.6
Base Model (Hint prompt)39.479.0
Counterexample SFT Model (Hint)41.179.4
Non-counterexample SFT Model (Hint)39.978.4

The superior performance of counterexample-based models supports our claim that improvements stem from learning counterexamples rather than increased exposure to advanced data.

The current evaluation prompt may overly penalize valid and mathematically sound counterexamples.

Our evaluation prioritizes whether generated examples lead to the same final conclusion as the reference. While the examples need not match exactly, they are marked as “consistent” if they function equivalently in proving the statement.

Concerns about GPT-4o's evaluation.

Given the current challenges of human annotation in terms of cost and efficiency, our approach aligns with prior works about GPT-4o evaluation (e.g., DPO[1], SimPO[2]) and established benchmarks like MT-Bench[3] and Alpaca-Eval[4]. To enhance accuracy, we provided reference information for GPT-4o and our human-validated results further confirm the approach's reliability.

[1] Direct Preference Optimization: NeurIPS 2023
[2] SimPO: Simple Preference Optimization: NeurIPS 2024
[3] MT-Bench: NeurIPS 2023
[4] AlpacaEval-LC: COLM 2024

Concerns about the generalizability of SFT model to more challenging OOD datasets.

We extended evaluations to competition-level (OlympiadBench-Math, AIME 2024) and college-level (MMLU-College-Math) datasets:

ModelGSM8KMATHOlympiadBench-MathMMLU-College-MathAIME 2024
Qwen2.5-Math-Instruct-7B95.180.541.674.020.0
Our SFT Model95.687.946.480.030.0

Our SFT model shows significant improvements across both simpler and more complex OOD datasets, supporting our hypothesis that learning from counterexamples enhances performance over a range of difficulty levels.

Lack of training details.

We apologize for the oversight. our training used Qwen2.5-Math-Instruct on two L20 48G GPUs with LoRA (rank 16). The learning rate was 1.0e-4 for one epoch. Full details will be clarified in the final version.

Suggestion to add performance of o1 model on simpler benchmarks.

Below is a comparison using official o1-preview results (due to resource constraints):

ModelCountermathAIME 2024MMLUMATH-500
F-1 Example60.139.874.490.8
Strict55.839.874.485.5

The gap between COUNTERMATH and simpler benchmarks underscores the challenge of counterexample-based reasoning.

Hint prompt evaluations limited to Qwen2.5-Math-7B-Instruct.

We have also tested the Hint prompt with the Deepseek-Math-7B-rl model. Although space constraints prevented us from including these results in the paper, the summary is as follows:

ModelOverall F-1Examples (%)Strict (%)Loose (%)
Deepseek-Math-7B-rl79.165.918.920.6
Deepseek-Math-7B-rl + Hint78.483.525.627.7

The Hint prompt notably increases the "Examples (%)" metric, demonstrating its positive impact on example generation. Due to consistent trends observed between Qwen2.5-Math-7B-Instruct and Deepseek-Math-7B-rl, we limited further evaluations to these two models.

Evaluation "consistency" of examples when facing multiple examples.

A generated example is considered "consistent" if it effectively serves as a valid proof for the statement—matching the intended meaning and function of at least one reference example. So even if the example is different from the ones we provided, it’s still considered consistent as long as it has the same meaning.

Clarification on Table 3 metrics.

We apologize for the confusion. The values represent accuracy, not F1 scores. This will be clarified in the final version.

审稿意见
3

The paper studies the capability of LLMs in providing counterexamples for mathematical proofs. Specifically, the paper proposes a benchmark of university level natural language theorem statements along with their rationale and correctness. The authors evaluate a wide range of LLMs on the F1 score and the counterexample usage frequency. Experimental results suggest the benchmark is quite challenging and finetuning on counterexample proof samples enhance reasoning capability on standard benchmarks like GSM8K and MATH.

给作者的问题

  1. On Line 200-202, Evaluation Metrics. "We use lexical matching such as F1 to match the judgments of the statements". What is lexical matching? Also how exactly is F1 calculated here?

  2. Why are strict and loose align meaningful metrics to evaluate? I imagine there could exist many valid counterexamples for a given statement. Both strict and loose align seems to suggest that the model should really provide counterexamples given by textbooks. Is this too restrictive?

  3. Can the authors provide more details on which subject or what type of questions the SFT model improves the most on MATH?

论据与证据

The claims are supported by experimental results.

方法与评估标准

The proposed benchmark is interesting and tackles a unique angle in mathematical reasoning. The evaluation metric of F1 score makes sense for a highly unbalanced dataset. However, the strict and loose matching criteria seem to be not appropriate especially because there might be many correct counterexamples that are different from the counterexamples provided by the textbooks.

理论论述

N/A.

实验设计与分析

The experimental design is reasonable and the authors evaluate a wide range of open-source and proprietary LLMs.

补充材料

Yes, I read the prompts and data annotation details. I also took a look at the samples in the zip file.

与现有文献的关系

The contribution of the paper is relevant and timely.

遗漏的重要参考文献

N/A.

其他优缺点

One strong experimental result is that the authors show that by finetuning on counterexample training data, the model improves significantly on MATH. which is very encouraging.

其他意见或建议

Please see questions below.

作者回复

Response to Reviewer H7Ry

Thank you for your efforts in the review, and hope our response will address your concerns.

On Line 200-202, Evaluation Metrics. "We use lexical matching such as F1 to match the judgments of the statements." What is lexical matching? Also, how exactly is F1 calculated here?

We apologize for the ambiguity. "Lexical matching" refers to extracting the model's final judgment (True or False) using regular expressions. Specifically, we parse the model's output to identify the final decision and map "False," "True," and empty responses to numerical values (0, 1, and 2, respectively). The F1-score is then computed as a macro-averaged multiclass F1, where each class's F1-score is calculated as: F1 = 2 * (precision * recall) / (precision + recall). We implement this using the evaluate library's F1 function, treating the empty response as a separate class. The final reported score is the macro-averaged F1 across all classes.

Why are strict and loose align meaningful metrics to evaluate? I imagine there could exist many valid counterexamples for a given statement. Both strict and loose align seem to suggest that the model should really provide counterexamples given by textbooks. Is this too restrictive?

Our evaluation aims to assess whether the model can generate examples that support its reasoning. During evaluation, we use GPT-4o to verify whether the model-generated examples lead to the same final conclusion as reference examples. This approach allows for flexibility: the generated examples do not need to match the reference ones exactly but should be logically consistent in leading to the conclusion. Thus, our metric captures the model’s reasoning ability rather than enforcing rigid textbook-style counterexamples.

Can the authors provide more details on which subject or what type of questions the SFT model improves the most on MATH?

We conduct a case study on 50 instances where the base model failed but the SFT model succeeded. Since the MATH dataset lacks explicit problem type annotations, we manually annotate and categorize them. The most significant improvements are observed in:

  1. Geometry: Problems requiring length and area calculations. The base model struggles with triangle proportions and spatial reasoning.
  2. Number Theory: Tasks like repeating decimal periods and divisibility checks, where errors stem from incorrect cyclic period calculations or overlooked edge cases.
  3. Algebra: Problems involving averages, fractions, and arithmetic, where the base model frequently miscalculates.

The SFT model demonstrates a better understanding of mathematical concepts, more effective application of formulas, and improves accuracy in calculations. Here is an example:

Question: Sallie earned a grade of exactly 90% based on six equally weighted tests. The four test papers she found have grades of 83%, 96%, 81%, and 82%. What is the sum of the two missing grades?
Correct Answer: 198

  • Base Model: Predicted 208
    • Step: 83+96+81+82+x+y6=90\frac{83 + 96 + 81 + 82 + x + y}{6} = 90342+x+y=540342 + x + y = 540
    • Error: Miscomputed 540342=208540 - 342 = 208 (correct value is 198).
  • SFT Model: Predicted 198
    • Step: Same, but correctly computed 540342=198540 - 342 = 198.
最终决定

The paper was reviewed by five reviewers and received mixed scores. The reviewers generally agree that the paper tackles an interesting and under-explored problem (using LLMs to find counterexamples) and propose a potentially useful benchmark. However, opinions are mixed due to concerns about the evaluation and ablation. The authors addressed many of these issues in the rebuttal with new experiments and clarifications, though some concerns remain. The AC recommends a weak accept.