5.8

/10

Rejected5 位审稿人

最低5最高8标准差1.2

3.8

置信度

正确性2.6

贡献度2.2

表达3.0

ICLR 2025

Putnam-AXIOM: A Functional & Static Benchmark for Measuring Higher Level Mathematical Reasoning in LLMs

Aryan Gulati,Brando Miranda,Eric Chen,Emily Xia,Kai Fronsdal,Bruno de Moraes Dumont,Sanmi Koyejo

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

Putnam-AXIOM is a challenging mathematical reasoning benchmark for LLMs, revealing significant reasoning performance gaps and the impact of data contamination.

摘要

关键词

BenchmarksLarge Language ModelsMathematical ReasoningMathematicsReasoningMachine Learning

评审与讨论

审稿意见

评分: 5置信度: 42024-10-29

Authors propose a new benchmark namely Putnam-AXIOM with 236 problems taken from the famous William Lowell Putnam Mathematical Competition and they also include 52 additional problems with variations. They perform experimentation and explore new metrics and evaluate the performance of SOTA LLMs on their benchmark

优点

The proposed benchmark consists of problems from an age old well known competition known for its hardness so these problems can be potentially useful to the community
Authors have carefully chosen the problems and have made changes so that LLMs could be evaluated for the problems.
Authors have explored various metrics and have done sufficient number of experiments to show that even the largest of LLMs perform rather poorly on their benchmark
Authors have included another 52 problems with variations and have shown that performance of LLMs drops siginificantly with even slightest of variations

缺点

Given benchmark is too small, I fully acknowledge the hardness of the problems from Putnam mathematical competition but a dataset of just 236 problems is too small . For instance JEEBench [1] which has problems of same difficulty as the problems present in the proposed dataset has around 515 problems. So I think that there should be atleast 500 problems for the dataset to be more comprehensive or authors can define broad problem categories and can include sufficient number of problems in each category such that the total number of problems in the dataset is more than 500.
Given that the dataset is the primary contribution of the paper, authors should include a discussion about the dataset construction. process, i.e, how the problems were selected, % of problems selected for each category. For example a) there are 11 distinct domains that have been defined in the paper, how many problems are present in each of these domains? b) How were the problems selected for each domain. Was there any criteria that was followed for selecting problems c) Authors can define relatively define easy, medium and hard categories and further classify the problems based on this
Authors could have experimented with other prompting methods apart from few shot prompting such as Progressive Hint Prompting [2], as for each problem the LLM interacts utilises the previously generated answer as a hint which allows the LLM to potentially correct its mistake. Experimenting with such methods would provide crucial insights if the LLMs could correct its mistakes.

[1]: Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

[2]: Progressive-Hint Prompting Improves Reasoning in Large Language Models

问题

Can you describe potential limitations and future scope for your work
Is the evaluation process automated or does it require a human with considerable amount of mathematical rigour and expertise to sit down and check the responses of LLMs to see if the answers?

评论- Official Response by Authors to Reviewer oqDm

2024-11-27

Thank you for your thoughtful review and for highlighting the importance of our work in mathematical reasoning evaluation and benchmark development. For calibration, we'd like to note that ICLR 2025 uses a different rubric than other conferences; for instance, a rating of 8 at ICLR 2025 only indicates a basic "Accept". Furthermore,

"Accept" is 8 at ICLR 2025 (vs 7 at NeurIPS 2024)
"Strong Accept" is 10 at ICLR 2025 (vs 9 at NeurIPS 2024)
ICLR 2025 does not use 7 or 9

We kindly request to take this into account when reviewing our paper.

“Given benchmark is too small, I fully acknowledge the hardness of the problems from Putnam mathematical competition but a dataset of just 236 problems is too small . For instance JEEBench [1] which ...”

We are expanding the dataset (and adding more variations) by incorporating Putnam problems from 1938-1984 using [1][2] beyond the Putnam Archive (1985-2023), effectively doubling its coverage while preserving its rigor and mathematical sophistication. This expansion enables evaluation of LLMs' mathematical reasoning beyond final answers by leveraging our novel TFA-based metric for assessing proof quality, providing deeper insights into models' mathematical capabilities.

[1] Gleason, Andrew M., et al. The William Lowell Putnam Mathematical Competition Problems and Solutions: 1938-1964. Taiwan, Mathematical Association of America, 1980. [2] G.L. Alexanderson, L.F. Klosinski, and L.C. Larson. The William Lowell Putnam Mathematical Competition: Problems and Solutions, 1965-1984. MAA problem books series. Mathematical Association of America, 1985.

“Given that the dataset is the primary contribution of the paper, authors should include a discussion about the dataset construction. process, i.e, how the problems were selected, % of problems selected for each category ..."

Thank you for pointing out this area of potential confusion; we will elaborate further on this point in our revised submission. We submitted problems primarily based on their suitability to yield a final, fixed boxed answer to be compatible with our evaluation method. 'Combinatorics': 58, 'Algebra': 61, 'Geometry': 43, 'Calculus': 75, 'Analysis': 73, 'Number Theory': 95, 'Probability': 21, 'Trigonometry': 17, 'Linear Algebra': 25, 'Complex Numbers': 10, 'Differential Equations': 1. Note that in our classification, we allowed questions to be classified under multiple domains. Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 99 easy, 78 medium, and 59 hard.

“Authors could have experimented with other prompting methods apart from few shot prompting ... ”

Prompting We are experimenting with different prompting methods. Beyond the few-shot prompt we discussed in our paper, we’re also testing models with zero-shot prompting and Putnam few-shot prompting. Our Putnam few-shot prompt is the same prompt + few shot examples as our few-shot prompt, except we inform the models that they are looking at problems from the Putnam exam.

Currently, we’ve chosen some better performing Open Source models to run these experiments and intend on expanding to GPT and Claude for the final paper. The results are shown in the table below As expected, zero shot prompting resulted in a lower accuracy than few shot. However, what is interesting is that by telling the model they were looking at Putnam problems, we see a differing result across models. This is something we intend on investigating more with more models.

Model	Zero Shot	Few Shot	Putnam Few Shot
Numina AI	4.23	4.66	8.05
Qwen2 Math 7B	4.66	5.51	5.93
Deepseek Math 7B Base	2.96	5.93	5.51
Deepseek Math 7B RL	6.77	8.05	5.08

We plan on running a suite of other prompting methods (including Progressive Hint Prompting, CoT, and changing the few shot examples to be Putnam problems) on a larger set of models as soon as we finish expanding our dataset.

Comparative analysis across benchmarks A quick examination shows that Putnam-AXIOM is significantly more challenging than its contemporary benchmarks. GPT-4's performance demonstrates this:

Putnam-AXIOM: 9.32%
SciBench math: 44.52%
JEEBench math: 28%
ARB symbolic math: 18%

“Is the evaluation process automated or does it require a human with considerable amount of mathematical rigour and expertise to sit down and check the responses of LLMs to see if the answers?”

Thank you for this note; we’ll emphasize this point in our revised paper. One of the key contributions of our benchmark is that it does not require human graders to be evaluated. The entire evaluation process, like MATH, is automated with boxed answers and a built-in equivalence function.

评论- Updates to Response by Authors to Reviewer oqDm

2024-12-04

We appreciate your detailed review and thoughtful suggestions. We are pleased to share several substantial improvements we've made to address your concerns:

Dataset Size and Comprehensiveness Your concern about dataset size was well-taken. We have significantly expanded Putnam-AXIOM:

Increased the Original benchmark size from 236 to 522 boxed question-answer pairs by incorporating problems from 1938-1984. The updated breakdown for the dataset follows - ‘Combinatorics': 79, 'Algebra': 193, 'Geometry': 125, 'Calculus': 143, 'Analysis': 177, 'Number Theory': 135, 'Probability': 29, 'Trigonometry': 26, 'Linear Algebra': 41, 'Complex Numbers': 29, 'Differential Equations': 20. Note that in our classification, we allowed questions to be classified under multiple domains. Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 205 easy, 182 medium, and 135 hard.
Nearly doubled our functional variations from 53 to 100 (representing a ~30% conversion rate to constant + variable changes of our 522 original questions, with ongoing expansion estimating 150 variations)
This expansion maintains the benchmark's mathematical rigor while providing more comprehensive coverage

Model	Variation Accuracy	Corresponding Original Accuracy
Llama-3.1-8b	3.85	6.41
Llama-3.1-8b Instruct	6.41	12.82
Mistral-7B-v0.1	2.56	7.69
Mistral-7B-Instruct-v0.3	6.41	6.41
Gemma-7b	6.41	6.41
Gemma-2b	2.56	5.13
Gemma-7b-it	5.13	7.69
Gemma-2b-it	1.28	2.56
Deepseek-Math-7b-Base	2.56	6.41
Deepseek-Math-7b-Rl	11.54	12.82
Deepseek-Math-7b-Instruct	8.97	11.54
NuminaMath-7B-TIR	12.82	12.82
Qwen2-Math-7B	11.54	15.38
Qwen2-Math-7B-Instruct	5.13	11.54

We have further expanded our evaluations across more prompting techniques. In our Putnam Context prompt, we include in the prompt that our problems are from the Putnam Competition.

The following accuracies are on the new dataset with 522 questions. :

Model	Zero Shot	Few Shot	Putnam Context + Few Shot	Few Shot COT	Zero Shot Tree of Thought	Few Shot Tree of Thought
Llama-3.1-8b	1.53	3.25	4.78	3.25	2.87	4.02
Llama-3.1-8b Instruct	6.13	5.75	5.75	5.75	8.43	5.17
Mistral-7B-v0.1	9.67	3.64	4.41	3.83	0.57	2.68
Mistral-7B-Instruct-v0.3	2.11	4.02	4.02	4.21	1.91	4.59
Gemma-7b	1.34	4.60	4.41	4.21	0.19	4.02
Gemma-2b	0.00	2.87	4.60	3.26	0.38	3.44
Gemma-7b-it	4.41	3.07	3.64	3.45	1.72	3.44
Gemma-2b-it	0.96	0.95	0.57	1.34	1.15	0.57
Deepseek-Math-7b-Base	2.87	4.02	3.83	5.75	2.11	3.64
Deepseek-Math-7b-Rl	8.04	8.62	8.23	9.96	9.00	9.96
Deepseek-Math-7b-Instruct	6.13	6.89	6.70	6.32	7.10	8.04
NuminaMath-7B-TIR	9.38	10.34	8.43	8.24	9.38	9.77
Qwen2-Math-7B	8.81	9.57	10.54	9.77	6.70	7.85
Qwen2-Math-7B-Instruct	18.2	11.49	12.26	13.4	19.92	11.88
GPT-4	-	11.30	-	-	-	-
GPT-4o	-	19.35	-	-	-	-

审稿意见

评分: 8置信度: 42024-11-04

The paper proposes a new benchmark to evaluate the reasoning capabilities of languages models. The benchmark is generated by taking original Putnam problems from the Putnam Mathematical Competition and adapting them such that they are well-suited for automated verification of proposed solutions to the problem. Furthermore, the problems are parametrized meaning that new problems can be generated continuously generated. These two features set the proposed benchmark apart from existing reasoning benchmarks for LLMs.

优点

The paper tackles an important problem of creating a benchmark for which solutions can be verified in an automated fashion (avoiding expensive human labor) and for which the train set contamination set can be addressed. That is, new problems can be generated ad infinitum simply by changing the parametrization of the problem.

Under the premise that LLMs can never see all possible problems this benchmarks poses an interesting challenge to the reasoning capabilities of LLMs. As such it has the potential to play an important role in the further development of new language models architectures.

缺点

The paper is a little bit short on how new problems are generated.

问题

How important are the two functional changes that can be made to the problems. That is, variable change and constant change. Intuitively one would say constant change. Have you investigated this?
When would you consider the benchmark to be solved?
How do you generate/sample new problems? And how to you make sure that you sample problems uniformly form the problem space. It might be that all the problems you sample fall into the same region and that the LLM could pick up on the statistics of this once we have train data contamination. This ties into the question on when you would consider the benchmark solved. As it might be possible that a different sampling strategy might again lead to degrading performance.
Is it correct to say that is an LLM can reason it will be able to solve to benchmark but the inverse is not true. I.e. if it solves the benchmark it might not perform reasoning.

伦理问题详情

评论- Official Response by Authors to Reviewer 7tdn

2024-11-27

"Accept" is 8 at ICLR 2025 (vs 7 at NeurIPS 2024)
"Strong Accept" is 10 at ICLR 2025 (vs 9 at NeurIPS 2024)
ICLR 2025 does not use 7 or 9

We kindly request to take this into account when reviewing our paper.

“The paper is a little bit short on how new problems are generated.”

Thank you for this feedback; we will elaborate further on how new problems are generated in our revised paper. New problems are generated utilizing Python functions and f-strings. We randomly generate variable names and constants (for constant change questions) to input as parameters, and our functions return the strings corresponding to the question statement and solution. We gather these function outputs in our variation dataset.

“How important are the two functional changes that can be made to the problems. That is, variable change and constant change. Intuitively one would say constant change. Have you investigated this?”

On an older version of the dataset, we conducted an analysis of variable change VS constant change. We found that constant changes were significantly more difficult than the variable changes. We currently plan on running similar analyses once we have finished expanding our variation dataset and will report on our results once finished.

“When would you consider the benchmark to be solved?”

The community norm for when a benchmark has become saturated is when models can consistently obtain over 90% accuracy. Following this standard, we'll consider our benchmark solved when SOTA models can reliably achieve 90% accuracy.

“How do you generate/sample new problems? And how to you make sure that you sample problems uniformly form the problem space. It might be that all the problems you sample fall into the same region and that the LLM could pick up on the statistics of this once we have train data contamination. This ties into the question on when you would consider the benchmark solved. As it might be possible that a different sampling strategy might again lead to degrading performance.”

Thank you for pointing out this concern. We have sampled problems evenly from subject area, difficulties, and answer type. Here is the breakdown:

'Combinatorics': 58, 'Algebra': 61, 'Geometry': 43, 'Calculus': 75, 'Analysis': 73, 'Number Theory': 95, 'Probability': 21, 'Trigonometry': 17, 'Linear Algebra': 25, 'Complex Numbers': 10, 'Differential Equations': 1. Note that in our classification, we allowed questions to be classified under multiple domains.

Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 99 easy, 78 medium, and 59 hard. We had 78 modified boxing questions and 158 original boxed questions.

In particular, we have an even distribution between final-answer type questions and proof-based questions, which we represent using modified boxing. Furthermore, the distribution in subject areas reflects the inherent distribution of problems in the original Putnam exam.

“Is it correct to say that is an LLM can reason it will be able to solve to benchmark but the inverse is not true. I.e. if it solves the benchmark it might not perform reasoning.”

LLM performance on our benchmark would not definitively determine whether LLMs can reason, but due to the creativity and mathematical logic needed to solve questions as difficult as Putnam’s, it provides positive evidence of mathematical reasoning capability.

评论- Updated Response by Authors to Reviewer 7tdn

2024-12-04

Thank you for recognizing the quality and difficulty of our dataset. However, since our previous comment we have made several substantial improvements that we hope further solves your concerns.

Dataset Size and Comprehensiveness Your concern about dataset size was well-taken. We have significantly expanded Putnam-AXIOM:

Increased the Original benchmark size from 236 to 522 boxed question-answer pairs by incorporating problems from 1938-1984. The updated breakdown for the dataset follows - ‘Combinatorics': 79, 'Algebra': 193, 'Geometry': 125, 'Calculus': 143, 'Analysis': 177, 'Number Theory': 135, 'Probability': 29, 'Trigonometry': 26, 'Linear Algebra': 41, 'Complex Numbers': 29, 'Differential Equations': 20. Note that in our classification, we allowed questions to be classified under multiple domains. Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 205 easy, 182 medium, and 135 hard.
Nearly doubled our functional variations from 53 to 100 (representing a ~30% conversion rate to constant + variable changes of our 522 original questions, with ongoing expansion estimating 150 variations)
This expansion maintains the benchmark's mathematical rigor while providing more comprehensive coverage

Model	Variation Accuracy	Corresponding Original Accuracy
Llama-3.1-8b	3.85	6.41
Llama-3.1-8b Instruct	6.41	12.82
Mistral-7B-v0.1	2.56	7.69
Mistral-7B-Instruct-v0.3	6.41	6.41
Gemma-7b	6.41	6.41
Gemma-2b	2.56	5.13
Gemma-7b-it	5.13	7.69
Gemma-2b-it	1.28	2.56
Deepseek-Math-7b-Base	2.56	6.41
Deepseek-Math-7b-Rl	11.54	12.82
Deepseek-Math-7b-Instruct	8.97	11.54
NuminaMath-7B-TIR	12.82	12.82
Qwen2-Math-7B	11.54	15.38
Qwen2-Math-7B-Instruct	5.13	11.54

We have further expanded our evaluations across more prompting techniques. In our Putnam Context prompt, we include in the prompt that our problems are from the Putnam Competition.

The following accuracies are on the new dataset with 522 questions. :

Model	Zero Shot	Few Shot	Putnam Context + Few Shot	Few Shot COT	Zero Shot Tree of Thought	Few Shot Tree of Thought
Llama-3.1-8b	1.53	3.25	4.78	3.25	2.87	4.02
Llama-3.1-8b Instruct	6.13	5.75	5.75	5.75	8.43	5.17
Mistral-7B-v0.1	9.67	3.64	4.41	3.83	0.57	2.68
Mistral-7B-Instruct-v0.3	2.11	4.02	4.02	4.21	1.91	4.59
Gemma-7b	1.34	4.60	4.41	4.21	0.19	4.02
Gemma-2b	0.00	2.87	4.60	3.26	0.38	3.44
Gemma-7b-it	4.41	3.07	3.64	3.45	1.72	3.44
Gemma-2b-it	0.96	0.95	0.57	1.34	1.15	0.57
Deepseek-Math-7b-Base	2.87	4.02	3.83	5.75	2.11	3.64
Deepseek-Math-7b-Rl	8.04	8.62	8.23	9.96	9.00	9.96
Deepseek-Math-7b-Instruct	6.13	6.89	6.70	6.32	7.10	8.04
NuminaMath-7B-TIR	9.38	10.34	8.43	8.24	9.38	9.77
Qwen2-Math-7B	8.81	9.57	10.54	9.77	6.70	7.85
Qwen2-Math-7B-Instruct	18.2	11.49	12.26	13.4	19.92	11.88
GPT-4	-	11.30	-	-	-	-
GPT-4o	-	19.35	-	-	-	-

The rapid pace of these improvements (all accomplished within one week) demonstrates both the scalability of our approach and our team's commitment to continuous enhancement. We are actively working on further expansions and improvements to make Putnam-AXIOM an even more comprehensive benchmark for mathematical reasoning.

Furthermore, to build upon previous results, we're expanding the evaluations to models with larger parameter sizes and architectures with some preliminary results already (currently only few shot but we plan to have all prompting styles ready for the final version):

Codestral-22B-v0.1: 4.66
Qwen2-Math-72B: 11.9
Qwen2.5-Math-72B: 10.6
Llama-3.1-70B: 4.24
Mixtral-8x7B-Instruct-v0.1: 5.93
Mixtral-8x22B-Instruct-v0.1: 8.47
Mistral-Small-Instruct-2409: 5.93

审稿意见

评分: 6置信度: 42024-11-04

Benchmarks are crucial for machine learning research. Existing benchmarks have become increasinglily struggle to sufficiently asses the reasoning capabilities of modern large language models. This is in part due to the impressive advancements that we have seen in the recent past, but also due to the problem of data leakage. Since modern models are trained on huge text corpora that are often scraped from the internet, it is increasingly hard to hold out challenging problems for assessment.

The submission introduces the Putnam-AXION benchmark dataset, which consists of 236 challenging mathematical problems from the William Lowell Putnam Mathematical Competition. All questions are modified so that there is a unique numerical answer (concrete number or formula) and the prompt instructs the model to provide the final answer in \boxed{} brackets, which simplifies evaluation (boxed-accuracy). Besides providing a new collection of challenging mathematical problems, the authors also identify a set of 52 question that can be altered in a systematic way as to lead to new problems with different solutions. The changes included rephrasing, renaming of variables and changing the value of constants. These functional variants are of particularly useful to asses how much of the performance on the original questions is due to memorization. The authors test a multitude f different model, both open source and commercially available ones, and experience a significant drop in performance when comparing the original Putnam-AXIOM questions with their functional variants. Almost all models show a statistically significant decrease in accuracy and the best performing model (o1-preview) shows a drop of 30%.

Additionally, the authors evaluate metrics which are capable of assessing the quality of an LLM's answer in more complex scenarios. Since Putnam-AXIOM is still quite challenging for most models, the evaluation was conducted on the simpler MATH dataset. Metrics:

Teacher forcing: Condition model on ground truth rather than on its own output. Variants: Accuracy, cross entropy, perplexity, bits per character
ROSCOE: Metrics defined by the ROCOE suit.

The evaluation uses that boxed-accuracy is available for the benchmark. The authors compute the correlation of each metric to the boxed accuracy in oder to evaluate their quality. It turns out that teacher forced accuracy had the highest correlation with the boxed-accuracy.

It should be mentioned that similar approaches to generate variants of benchmarks have already been explored with similar results, which is also clear stated by the authors. Although I think that the dataset will be a useful addition to the existing portfolio, especially because the problems are really challenging, I feel that it is not quite closing a unique gap. The evaluation of reasoning metrics is a useful addition. However, it is not quite clear to which extend additional rationales given in the ground truth were tested in these experiments. The authors should be a bit more precise at this point. Overall, I can give a weak recommendation for acceptance.

优点

New and challenging mathematical problems.
52 questions with functional variants.
Evaluation of many open source and commercial LLMs.
Evaluation of proxy metrics.

缺点

Variants creation is very simple.

问题

How is testing for equivalent answers combined with the teacher forced metrics?
Can you describe the setup for the evaluation of the proxy metrics? Was teacher forcing conducted on the boxed answer or on the entire rationale?

评论- Official Response by Authors to Reviewer VmpD

2024-11-27

"Accept" is 8 at ICLR 2025 (vs 7 at NeurIPS 2024)
"Strong Accept" is 10 at ICLR 2025 (vs 9 at NeurIPS 2024)
ICLR 2025 does not use 7 or 9

We kindly request to take this into account when reviewing our paper.

“Variants creation is very simple.”

Thank you for pointing out this concern; we’ll emphasize this in our revised submission. Our purpose in creating variations for the Putnam questions is to expose gaps in LLM performance––namely, potential data contamination of Putnam problems. Additional complexity in the variations beyond their existing levels would have been unnecessary.

“How is testing for equivalent answers combined with the teacher forced metrics?”

Thank you for pointing out this area of confusion; we will clarify this in our revised paper. Most current benchmarks––including ours––prompt the model to utilize boxed answers, and then use equivalence functions to compare boxed answers to the ground truth final answer. TFA is a proxy metric that doesn’t solely rely on the comparing equivalent boxed answers but rather considers the entire model’s response holistically and evaluates its reasoning. TFA does not utilize boxed answers or the equivalence function in computing its score for model responses.

“Can you describe the setup for the evaluation of the proxy metrics? Was teacher forcing conducted on the boxed answer or on the entire rationale?”

Thank you for pointing out this potential area of confusion; we will clarify this in our revised paper. As mentioned above, the proxy metrics are run on the entire rationale provided by the model. TFA does not utilize the boxed answers in any way in computing its scores.

评论- Updated Response by Authors to Reviewer VmpD

2024-12-04

Thank you for recognizing the quality and difficulty of our dataset. However, since our previous comment we have made several substantial improvements that we hope further solves your concerns.

Dataset Size and Comprehensiveness Your concern about dataset size was well-taken. We have significantly expanded Putnam-AXIOM:

Increased the Original benchmark size from 236 to 522 boxed question-answer pairs by incorporating problems from 1938-1984. The updated breakdown for the dataset follows - ‘Combinatorics': 79, 'Algebra': 193, 'Geometry': 125, 'Calculus': 143, 'Analysis': 177, 'Number Theory': 135, 'Probability': 29, 'Trigonometry': 26, 'Linear Algebra': 41, 'Complex Numbers': 29, 'Differential Equations': 20. Note that in our classification, we allowed questions to be classified under multiple domains. Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 205 easy, 182 medium, and 135 hard.
Nearly doubled our functional variations from 53 to 100 (representing a ~30% conversion rate to constant + variable changes of our 522 original questions, with ongoing expansion estimating 150 variations)
This expansion maintains the benchmark's mathematical rigor while providing more comprehensive coverage

Model	Variation Accuracy	Corresponding Original Accuracy
Llama-3.1-8b	3.85	6.41
Llama-3.1-8b Instruct	6.41	12.82
Mistral-7B-v0.1	2.56	7.69
Mistral-7B-Instruct-v0.3	6.41	6.41
Gemma-7b	6.41	6.41
Gemma-2b	2.56	5.13
Gemma-7b-it	5.13	7.69
Gemma-2b-it	1.28	2.56
Deepseek-Math-7b-Base	2.56	6.41
Deepseek-Math-7b-Rl	11.54	12.82
Deepseek-Math-7b-Instruct	8.97	11.54
NuminaMath-7B-TIR	12.82	12.82
Qwen2-Math-7B	11.54	15.38
Qwen2-Math-7B-Instruct	5.13	11.54

We have further expanded our evaluations across more prompting techniques. In our Putnam Context prompt, we include in the prompt that our problems are from the Putnam Competition.

The following accuracies are on the new dataset with 522 questions. :

Model	Zero Shot	Few Shot	Putnam Context + Few Shot	Few Shot COT	Zero Shot Tree of Thought	Few Shot Tree of Thought
Llama-3.1-8b	1.53	3.25	4.78	3.25	2.87	4.02
Llama-3.1-8b Instruct	6.13	5.75	5.75	5.75	8.43	5.17
Mistral-7B-v0.1	9.67	3.64	4.41	3.83	0.57	2.68
Mistral-7B-Instruct-v0.3	2.11	4.02	4.02	4.21	1.91	4.59
Gemma-7b	1.34	4.60	4.41	4.21	0.19	4.02
Gemma-2b	0.00	2.87	4.60	3.26	0.38	3.44
Gemma-7b-it	4.41	3.07	3.64	3.45	1.72	3.44
Gemma-2b-it	0.96	0.95	0.57	1.34	1.15	0.57
Deepseek-Math-7b-Base	2.87	4.02	3.83	5.75	2.11	3.64
Deepseek-Math-7b-Rl	8.04	8.62	8.23	9.96	9.00	9.96
Deepseek-Math-7b-Instruct	6.13	6.89	6.70	6.32	7.10	8.04
NuminaMath-7B-TIR	9.38	10.34	8.43	8.24	9.38	9.77
Qwen2-Math-7B	8.81	9.57	10.54	9.77	6.70	7.85
Qwen2-Math-7B-Instruct	18.2	11.49	12.26	13.4	19.92	11.88
GPT-4	-	11.30	-	-	-	-
GPT-4o	-	19.35	-	-	-	-

Codestral-22B-v0.1: 4.66
Qwen2-Math-72B: 11.9
Qwen2.5-Math-72B: 10.6
Llama-3.1-70B: 4.24
Mixtral-8x7B-Instruct-v0.1: 5.93
Mixtral-8x22B-Instruct-v0.1: 8.47
Mistral-Small-Instruct-2409: 5.93

审稿意见

评分: 5置信度: 32024-11-04

The paper introduces Putnam-AXIOM - a new dataset to evaluate LLMs’ mathematical reasoning abilities. The dataset is difficult for state-of-the-art models, where we see the highest accuracy being only around 42%. The paper further aims to address the data contamination problem by introducing functional variations which alters variables, constants, and phrasing of the problems. The dataset allows automatic evaluation of final solution.

The paper also introduces simple evaluation metrics such as teacher-forced accuracy (TFA), which evaluates intermediate steps by comparing to ground truth step by step solution. They observe TFA has better correlation to the final solution correctness than existing metrics like ROSCOE and BPC, despite TFA’s dependency on the ground truth and expectancy to under represent model abilities.

优点

The paper addresses the issue that existing reasoning benchmarks are getting outpaced by the LLMs. The proposed dataset is difficult for state-of-the-art LLMs such as GPT-o1
The paper addresses the issue of data contamination by following methods to generate new data unseen during LLM training introduced by Srivastava et al. (2024). Specifically, the variables, constants, and phrasing of some eligible problems are altered. The proposed dataset variation sees LLM performances drop significantly.
The paper introduces an evaluation metric that achieves higher correlation to the box accuracy when tested on easier arithmetic reasoning dataset.

缺点

Weaknesses:

According to the paper, the proposed evaluation metrics cannot be used to evaluate proprietary models, which makes up a large portion of the state-of-the-art models.
The LLMs evaluated are relatively small outside of the few state-of-the-art models: other than Claude-3.5 Sonnet, GPT-4, GPT-4o, GPT-o1, only 7B/8B models are evaluated in table 1. How does larger models like llama-70b perform?
The dataset is relatively small, and is not easily extendable since a lot of manual work is required for new problems, including editing the problem to perform modified boxing (sec 3.1) and the need to manually identify problems suitable for functional variation (section 3.2), as well as writing the generation code for each problem (ex. https://anonymous.4open.science/r/putnam-axiom-B57C/src/variations/putnam2023.py).

Other comments:

Figure 3 and 11 needs to be better presented, the labels are hard to read.

问题

I am confused why the paper did not present TFA scores for SOTA models in table 1. The paper justifies this by stating that “we can’t evaluate TFA on proprietary models as we require the log probabilities of the input tokens.” However, the TFA score introduced in section 3.4 seem to only require the ground truth data and the predicted tokens.
How is correlation calculated for evaluating the TFA metric?
What LLM model is used to produce table 2 to evaluate the metric? Does TFA out perform other metrics using different sized models?
The paper claims that the generated variations of the dataset has the same level of difficulty as the original. However, won’t perturbing the constants can sometimes make a problem more difficult? Would it still be a fair comparison if each LLM is evaluated on different variations?

评论- Official Response by Authors to Reviewer eDst

2024-11-27

Thank you for your thoughtful review and for highlighting the importance of our work in mathematical reasoning evaluation and benchmark development. For calibration, ICLR 2025 uses a different rubric than other conferences:

"Accept" is 8 at ICLR 2025 (vs 7 at NeurIPS 2024)
"Strong Accept" is 10 at ICLR 2025 (vs 9 at NeurIPS 2024)
ICLR 2025 does not use 7 or 9

We kindly request to take this into account when reviewing our paper.

“According to the paper, the proposed evaluation metrics cannot be used to evaluate proprietary models ...”

Thank you for pointing this out; we acknowledge this limitation and invite future work to design new and better proxy metrics. We will clarify this point in our revised paper. For proprietary models, it’s not always possible to run TFA. For instance, with the API for GPT 4, 4o, and o1, it’s impossible to prefill the “assistant message,” which is a crucial step for the TFA metric. Like Huang et al. [1], where BPC calculation requires internal token probabilities unavailable through closed APIs, reflecting a common challenge in similar research. However, our ongoing TFA experiments with Claude 3.5 Sonnet show promising results via subsampling, which we'll include in our revision.

[1] Huang, Y., Zhang, J., Shan, Z., & He, J. (2024). Compression represents intelligence linearly. arXiv preprint arXiv:2404.09937.

“The LLMs evaluated are relatively small outside of the few state-of-the-art models ... How does larger models like llama-70b perform?”

Updated numbers:

Codestral-22B-v0.1: 4.66
Qwen2-Math-72B: 11.9
Qwen2.5-Math-72B: 10.6
Llama-3.1-70B: 4.24
Mixtral-8x7B-Instruct-v0.1: 5.93
Mixtral-8x22B-Instruct-v0.1: 8.47
Mistral-Small-Instruct-2409: 5.93

“The dataset is relatively small, and is not easily extendable since a lot of manual work is required for new problems ...”

Thank you for pointing out this potential issue with the size of our dataset. During this rebuttal period, we have more than doubled the size of the Putnam-AXIOM Original dataset. We also plan on creating more variations for the new questions we’ve added. Unfortunately, manually editing and creating problems may be the best currently-available way to generate consistent, high-quality problems for a dataset. Due to the limited reliability of automated methods like LLMs, automatically generated questions would still require manual verification––also a time-consuming process.

“I am confused why the paper did not present TFA scores for SOTA models in table 1 ...”

Thank you for pointing out this issue with the TFA table sections. As we mentioned above, it’s not always possible to run TFA for closed-source models. For open source models we can efficiently compute teacher forcing on an entire string with a single forward pass similar to how these models are trained. However, frontier model APIs only let us generate tokens autoregressively. Thus each new token is conditioned not on the ground truth solution as in TFA, but on the previously generated solutions. For instance, with the API for GPT 4, 4o, and o1, it’s impossible to prefill the “assistant message,” which is a crucial step for the TFA metric though we are optimistic about Claude3.5 using a subsampling approximation.

“How is correlation calculated for evaluating the TFA metric?”

Thank you for pointing this out. We collect a large list of 20 models. We first compute model accuracy on MATH. We then compute the proxy metric for each model on MATH. We compute the correlation between model accuracy and the value of the proxy metric for each model (i.e. we compute the correlation on 20 data points). We will clarify this in our revision.

“What LLM model is used to produce table 2 to evaluate the metric? Does TFA out perform other metrics using different sized models?”

Thank you for pointing this out. We will make this more clear in our revisions. To create Table 2, we used 20 diverse models ranging from 7B to 72B parameters, including Meta-Llama-3-70B, Mixtral series, Llama-2 variants, Qwen models, deepseek, EleutherAI's llemma, and Google's Gemma models.

“The paper claims that the generated variations of the dataset has the same level of difficulty as the original. However, won’t perturbing the constants can sometimes make a problem more difficult? Would it still be a fair comparison if each LLM is evaluated on different variations?” Thank you for pointing out this issue with potential difficulty inconsistency. We intentionally designed our variation dataset to maintain consistent difficulty through two design principles:

Variable changes simply rename variables (e.g., 'x+5=6' to 'y+5=6') without affecting complexity.
Constant changes are limited to problems requiring the solver to derive general-form solutions or those with predictable relationships (like linear scaling) between question constants and solutions

These constraints ensure that variations remain at the same difficulty level as the original problems.

2024-11-28

I thank the authors for their detailed response. I have adjusted my score accordingly. I still note that the dataset is difficult to expand without extensive manual work, though I acknowledge the challenging nature of generating high-quality problems as the authors pointed out.

评论- Updates to Response by Authors to Reviewer eDst

2024-12-04

Thank you for recognizing the quality and difficulty of our dataset. However, since our previous comment we have made several substantial improvements that we hope further solves your concerns.

Dataset Size and Comprehensiveness Your concern about dataset size was well-taken. We have significantly expanded Putnam-AXIOM:

Increased the Original benchmark size from 236 to 522 boxed question-answer pairs by incorporating problems from 1938-1984. The updated breakdown for the dataset follows - ‘Combinatorics': 79, 'Algebra': 193, 'Geometry': 125, 'Calculus': 143, 'Analysis': 177, 'Number Theory': 135, 'Probability': 29, 'Trigonometry': 26, 'Linear Algebra': 41, 'Complex Numbers': 29, 'Differential Equations': 20. Note that in our classification, we allowed questions to be classified under multiple domains. Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 205 easy, 182 medium, and 135 hard.
Nearly doubled our functional variations from 53 to 100 (representing a ~30% conversion rate to constant + variable changes of our 522 original questions, with ongoing expansion estimating 150 variations)
This expansion maintains the benchmark's mathematical rigor while providing more comprehensive coverage

Model	Variation Accuracy	Corresponding Original Accuracy
Llama-3.1-8b	3.85	6.41
Llama-3.1-8b Instruct	6.41	12.82
Mistral-7B-v0.1	2.56	7.69
Mistral-7B-Instruct-v0.3	6.41	6.41
Gemma-7b	6.41	6.41
Gemma-2b	2.56	5.13
Gemma-7b-it	5.13	7.69
Gemma-2b-it	1.28	2.56
Deepseek-Math-7b-Base	2.56	6.41
Deepseek-Math-7b-Rl	11.54	12.82
Deepseek-Math-7b-Instruct	8.97	11.54
NuminaMath-7B-TIR	12.82	12.82
Qwen2-Math-7B	11.54	15.38
Qwen2-Math-7B-Instruct	5.13	11.54

We have further expanded our evaluations across more prompting techniques. In our Putnam Context prompt, we include in the prompt that our problems are from the Putnam Competition.

The following accuracies are on the new dataset with 522 questions. :

Model	Zero Shot	Few Shot	Putnam Context + Few Shot	Few Shot COT	Zero Shot Tree of Thought	Few Shot Tree of Thought
Llama-3.1-8b	1.53	3.25	4.78	3.25	2.87	4.02
Llama-3.1-8b Instruct	6.13	5.75	5.75	5.75	8.43	5.17
Mistral-7B-v0.1	9.67	3.64	4.41	3.83	0.57	2.68
Mistral-7B-Instruct-v0.3	2.11	4.02	4.02	4.21	1.91	4.59
Gemma-7b	1.34	4.60	4.41	4.21	0.19	4.02
Gemma-2b	0.00	2.87	4.60	3.26	0.38	3.44
Gemma-7b-it	4.41	3.07	3.64	3.45	1.72	3.44
Gemma-2b-it	0.96	0.95	0.57	1.34	1.15	0.57
Deepseek-Math-7b-Base	2.87	4.02	3.83	5.75	2.11	3.64
Deepseek-Math-7b-Rl	8.04	8.62	8.23	9.96	9.00	9.96
Deepseek-Math-7b-Instruct	6.13	6.89	6.70	6.32	7.10	8.04
NuminaMath-7B-TIR	9.38	10.34	8.43	8.24	9.38	9.77
Qwen2-Math-7B	8.81	9.57	10.54	9.77	6.70	7.85
Qwen2-Math-7B-Instruct	18.2	11.49	12.26	13.4	19.92	11.88
GPT-4	-	11.30	-	-	-	-
GPT-4o	-	19.35	-	-	-	-

审稿意见

评分: 5置信度: 42024-11-05

The paper introduces Putnam-AXIOM, a benchmark of 236 problems from the William Lowell Putnam Mathematical Competition, for evaluating reasoning capabilities of Large Language Models (LLMs). It allows for automated evaluations with an equivalence function for potential different mathematical presentations of the same final answer, that must be provided inside of "\boxed{}". The paper shows state-of-the-art LLMs perform poorly in Putnam-AXIOM. The authors addressed the potential data contamination issue with Putnam-AXIOM Variations, by altering the variable names, constant values, or the phrasing of the question, on selected 52 problems of the original Putnam-AXIOM and showed the performance of LLMs on this dataset is even worse.

优点

The benchmark addresses the limitations of LLMs reasoning, such as not being able to solve really unseen problems which indicates that they are probably memorizing solutions patterns from training data, instead of true understanding and reasoning. This is showed in the paper, by the low performance on this novel benchmark (Original) and by the even worse performance when altering variables, constants and question phrasing while maintaining the same logical reasoning required to solve the question (Variations). The paper shows it is an useful benchmark to evaluate LLMs reasoning capabilities since current benchmarks are already saturated. It also addresses the data contamination issue with the Putnam-AXIOM Variations.

缺点

ROSCOE metrics are not applied to the evaluation in Putnam-AXIOM and the boxed answer requirement discards intermediate steps to be used in the performance metric, so it doesn’t address the known LLM’s logical mistakes and leaps in the reasoning process, even when getting the right answer, despite being a problem identified by the authors in the beginning of the paper. The proposed TFA proxy metric is not evaluated on proprietary models so the state-of-the-art models are left out of the most comprehensive evaluation.

The 52 variations can give rise to an unlimited set of problems but they were created manually, and their logic can eventually be data contaminated as well which is not addressed. Also it is a too small dataset.

问题

Is there a way of automating the creation of new Putnam-AXIOM variations in order to avoid Logic data contamination? Do you agree Logic data contamination can happen, if not, why not?

评论- Official Response by Authors to Reviewer yRKp

2024-11-27

"Accept" is 8 at ICLR 2025 (vs 7 at NeurIPS 2024)
"Strong Accept" is 10 at ICLR 2025 (vs 9 at NeurIPS 2024)
ICLR 2025 does not use 7 or 9

We kindly request to take this into account when reviewing our paper.

“ROSCOE metrics are not applied to the evaluation in Putnam-AXIOM and the boxed answer requirement discards intermediate steps to be used in the performance metric, so it doesn’t address the known LLM’s logical mistakes and leaps in the reasoning process, even when getting the right answer, despite being a problem identified by the authors in the beginning of the paper. The proposed TFA proxy metric is not evaluated on proprietary models so the state-of-the-art models are left out of the most comprehensive evaluation.”

Thank you for pointing this out; we will clarify this point in our revised paper. We recognize the limitations of boxed answers––namely, that they don’t address logical leaps and errors in the model’s responses if they lead to the correct answer. Additionally, many problems do not have boxable answers (i.e. proofs or complex formulae) which limits the scope of current benchmarks. That the main unique strength behind using ROSCOE, TFA, and other proxy-metrics for reasoning. In particular, TFA and our other proxy metrics are applied to the entire reasoning trace, not just the boxed answer. We choose the MATH dataset to run our proxy metric evaluations because we can get a wide range of model performances on the dataset (Putnam-AXIOM does not have any high performing models). Additionally, we found a high degree of alignment between correct reasoning and correct final answer for this dataset (In particular, we found only 3 out of 75 correct final answers did not have correct reasoning we exampling a random sample). Thus we believe it is reasonable to compare against boxed accuracy on MATH as a substitute for correct reasoning traces. We discovered that the ROSCOE metrics had a poor correlation with boxed accuracy compared to TFA and our other proxy metrics on the MATH dataset, so we did not include it for Putnam-AXIOM. For proprietary models, it’s not always possible to run TFA. For instance, with the API for GPT 4/4o, and o1, it’s impossible to prefill the “assistant message,” which is a crucial step for the TFA metric. This limitation mirrors challenges faced in similar research, such as Huang et al. [1], where calculating Bits Per Character (BPC) requires access to internal token probabilities inaccessible through closed model APIs. However, we’re currently experimenting with TFA for Claude 3.5 Sonnet and see promising results with a subsampling approximation. We’ll include these in our revised submission.

[1] Huang, Y., Zhang, J., Shan, Z., & He, J. (2024). Compression represents intelligence linearly. arXiv preprint arXiv:2404.09937.

“The 52 variations can give rise to an unlimited set of problems but they were created manually, and their logic can eventually be data contaminated as well which is not addressed. Also it is a too small dataset.”

Thank you for the feedback on this potential issue with our dataset size and contamination. We've already doubled the size of Putnam-AXIOM Original, have 20 new variations created, and are planning to add more to both datasets. Further, our dataset size now aligns with other reasoning benchmarks (JEEBench: 515, ARB: 105, SciBench: 692), while being significantly more challenging (GPT-4 performance):

Putnam-AXIOM: 9.32%
SciBench math: 44.52%
JEEBench math: 28%
ARB symbolic math: 18%

“Is there a way of automating the creation of new Putnam-AXIOM variations in order to avoid Logic data contamination? Do you agree Logic data contamination can happen, if not, why not?”

Thank you for pointing out this potential issue with logic data contamination and for your suggestion on automating variation creation. We are currently experimenting with fine-tuning models on the corresponding 53 original questions to our variations, then evaluating the models on our variations. The results of this experiment will determine the significance of logic contamination. We also examined the possibility of using LLMs to automate the creation of variations by prompting GPT-4o to perform variable changes on several Putnam questions. While it succeeded in doing so with a small volume of questions, we found that GPT-4o often failed in keeping the mathematical content of the variable changes the same altering constants or equations. Given the unreliability, the variations would have to be human-verified and corrected anyways, which would likely take as much––if not more––than just selecting and creating variations manually.

评论- Updated Response by Authors to Reviewer yRKp

2024-12-04

Thank you so much for your concerns about dataset size. Since our last update we have completed expanding Putnam-AXIOM Original and significantly expanded Putnam-AXIOM Variations:

Increased the Original benchmark size from 236 to 522 boxed question-answer pairs by incorporating problems from 1938-1984 (now covering all Putnam problems since its first offering). The updated breakdown for the dataset follows - ‘Combinatorics': 79, 'Algebra': 193, 'Geometry': 125, 'Calculus': 143, 'Analysis': 177, 'Number Theory': 135, 'Probability': 29, 'Trigonometry': 26, 'Linear Algebra': 41, 'Complex Numbers': 29, 'Differential Equations': 20. Note that in our classification, we allowed questions to be classified under multiple domains. Based on Putnam’s own difficulty rating (1, 2 = ‘easy’, 3, 4 = ‘medium’, 5, 6 = ‘hard’), we had 205 easy, 182 medium, and 135 hard.
Nearly doubled our functional variations from 53 to 100 (representing a ~30% conversion rate to constant + variable changes of our 522 original questions, with ongoing expansion estimating ~150 variations)
This expansion maintains the benchmark's mathematical rigor while providing more comprehensive coverage as we can see with our updated eval results -

Model	236 Dataset Accuracy	522 Dataset Accuracy
Llama-3.1-8b	6.84	3.25
Llama-3.1-8b Instruct	10.27	5.75
Mistral-7B-v0.1	6.08	3.64
Mistral-7B-Instruct-v0.3	6.84	4.02
Gemma-7b	6.84	4.60
Gemma-2b	7.22	2.87
Gemma-7b-it	6.84	3.07
Gemma-2b-it	3.04	0.95
Deepseek-Math-7b-Base	7.60	4.02
Deepseek-Math-7b-Rl	11.02	8.62
Deepseek-Math-7b-Instruct	7.60	6.89
NuminaMath-7B-TIR	11.88	10.34
Qwen2-Math-7B	9.50	9.57
Qwen2-Math-7B-Instruct	12.55	11.49
GPT - 4	7.98	11.3
GPT - 4o	17.8	19.35
We also ran updated evals on the new variation dataset and the corresponding original questions.

Model	Variation Accuracy	Corresponding Original Accuracy
Llama-3.1-8b	3.85	6.41
Llama-3.1-8b Instruct	6.41	12.82
Mistral-7B-v0.1	2.56	7.69
Mistral-7B-Instruct-v0.3	6.41	6.41
Gemma-7b	6.41	6.41
Gemma-2b	2.56	5.13
Gemma-7b-it	5.13	7.69
Gemma-2b-it	1.28	2.56
Deepseek-Math-7b-Base	2.56	6.41
Deepseek-Math-7b-Rl	11.54	12.82
Deepseek-Math-7b-Instruct	8.97	11.54
NuminaMath-7B-TIR	12.82	12.82
Qwen2-Math-7B	11.54	15.38
Qwen2-Math-7B-Instruct	5.13	11.54

AC 元评审

2024-12-26

This paper introduces Putnam-AXIOM, a benchmark curated from the William Lowell Putnam Mathematical Competition, which originally consists of 236 problems (increased to 522 during the rebuttal). Among these problems, 52 (increased to 100 during the rebuttal) have functional variations (i.e. changing variables or constants), permitting unlimited number of variations. The evaluation shows that state-of-the-art LLMs like o1-preview achieves less than 42% accuracy, while most LLMs (e.g., GPT-40, Claude-3.5) achieve less than 20% accuracy. Putnam-AXIOM is particularly meaningful given existing reasoning datasets quickly get saturated due to rapid improvement of capable LLMs.

Static problems are relatively less meaningful given the increasing data contamination concern. Thus, the main concern raised by reviewers is whether 52 (now 100) problems with flexible variations are sufficient for a meaningful benchmark for assessing LLMs' reasoning capability. To make the benchmark much stronger, it is recommended to have formalized answers rather than boxed solutions in text.

审稿人讨论附加意见

During the rebuttal, the authors reminded the rating rubric of ICLR'25 is different from other ML conferences and shared new additions to the Putnam-AXIOM benchmark as well as updated evaluations. The new results suggest Putnam-AXIOM is generally more challenging than many existing ones like SciBench, JEEBench, ARB symbolic math. Unfortunately, reviewers do not form a general consensus (with one strong support, one borderline accept, while three boarderline rejects).

最终决定Reject

2025-01-22

Reject