PaperHub
4.9
/10
Poster4 位审稿人
最低2最高4标准差0.8
4
2
2
3
ICML 2025

Leveraging Online Olympiad-Level Math Problems for LLMs Training and Contamination-Resistant Evaluation

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

We propose a new dataset and benchmark for training and evaluating mathematical reasoning in LLMs.

摘要

关键词
Mathematical ReasoningLarge Language ModelsEvaluation

评审与讨论

审稿意见
4

This paper created a new math dataset sourced from the Art of Problem Solving. It designed a pipeline that includes (1) raw data collection, (2) math question detection, (3) question-answer extraction, (4) solution rewriting, and (5) data decontamination. Steps 1 to 3 involve the use of LLMs in the processing. To provide an unbiased benchmark for evaluating the performance of LLMs on math, this paper collected discussions between January 2023 and September 2024 and processed them through a more complex procedure to ensure benchmark quality. The authors further used this dataset to evaluate the performance of open-source models. There is a clear decrease in performance concerning the time at which the discussions took place.

update after rebuttal

I maintain my view that this paper is suitable for acceptance.

给作者的问题

  1. For question with 2 or 3 answers, how do you handle them?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

N/A.

实验设计与分析

The design of the experiments is well-structured. The first part of the evaluation focuses on data quality. The authors randomly select 10% of the benchmark questions and have a group of graduate students evaluate their correctness. Additionally, they compare the correctness of models on the Olympiad benchmark and LiveAoPSBench, observing a clear positive correlation. An ablation study also confirms the necessity of solution rewriting.

The second part examines the effectiveness of the math instruction fine-tuning dataset. The results indicate that AoPS-Ins improves the model more effectively than Numina.

The third part evaluates the performance of open-source models on LiveAoPSBench. Dividing topics by time is a clever approach to avoiding contamination. The observed performance decline further indicates that benchmarking still suffers from data contamination to some extent.

补充材料

No.

与现有文献的关系

This paper provides a new Math Instruction Fine-tuning dataset and a math benchmark. This effort contributes to the community of AI for math.

遗漏的重要参考文献

N/A.

其他优缺点

N/A

其他意见或建议

The short title at the top of each page exceeds the limit.

作者回复

We thank the reviewer for their positive comments. Below, we answer the main questions raised by the reviewer:

Q1: For question with 2 or 3 answers, how do you handle them?

A1: For the LiveAoPSBench evaluation set, we only retain questions with closed-form answers. If a solution provide multiple valid answers, they are summarized as a list of closed form answers. After this, we remove all questions that show discrepancies—i.e., different closed-form answers—between the community-provided solutions(see Section 3.2 LLM cross-checking).

In the AoPS-Instruct training set, we include all recognized solutions based on forum discussions for two reasons: (1) Most of the Q-A pairs in training set either have no final answer (e.g., proof-based questions) or the answer is free form, making it hard to rely on voting mechanisms to resolve discrepency. (2) Prior works such as Alphacode[1], have successfully used partially correct solutions for training their models. Therefore, we choose not to apply majority voting filters, allowing users to customize the filtering according to their needs.

[1] Li et al., Competition-Level Code Generation with AlphaCode, 2022

审稿意见
2

The paper introduces a scalable pipeline that leverages the Art of Problem Solving (AoPS) forum to construct two key resources for advancing Olympiad-level mathematical reasoning with LLMs. (1) AoPS-Instruct is a large-scale instruction-tuning dataset containing over 600k QA pairs, extracted and rewritten from AoPS forum posts. (2) LiveAoPSBench is a continously evolving, timestamped evaluation set intended to minimize overlap (contamination) with existing training corpora. Through extensive experiments, the authors demonstrate that (1) fine-tuning on AoPS-Instruct improves model performance on various math benchmarks, including Olympiad-level problems, and (2) older test sets often lead to inflated accuracy due to potential data contamination.

给作者的问题

  1. Would you explore evaluating the results on larger-scale LLMs (7B+ parameters)?
  2. Have you considered adapting your pipeline to handle open-ended proof-based questions with no single “boxed” answer? What modifications might be needed for verifying correctness in such contexts?

论据与证据

The authors claim that “Fine-tuning various LLMs on AoPS-Instruct lead to improved performance on standard benchmarks such as OlympiadBench, Omni-Math, and our LiveAoPSBench dataset, verifying the effectiveness of our dataset in enhancing math reasoning capabilities of LLMs.” However, their evaluation is limited to the LLMs with fewer than 7B parameters. To strengthen their claims, the authors should extend their experiments to larger-scale LLMs (exceeding 7B parameters) and provide a more comprehensive assessment of the proposed method’s effectiveness.

方法与评估标准

The evaluation criteria are somewhat limited due to the lack of larger-scale LLMs (exceeding 7B parameters) used.

理论论述

This paper does not include the theoretical claims.

实验设计与分析

The experimental analysis is valid. However, expanding the experiments to include a broader range of LLMs with more than 7B parameters would further strengthen the findings.

补充材料

I briefly reviewed the supplementary materials, which primarily contain more details on the proposed benchmarks.

与现有文献的关系

This paper is not connected to the broader scientific literature.

遗漏的重要参考文献

Regarding the mathematic benchmark for LLMs, the paper covers the necessary related works.

其他优缺点

Strengths:

  1. The authors present a fully automated procedure—from extracting forum posts, filtering and rewriting answers, to building a “live” test set—showcasing a viable way to harness community-generated data for advanced math tasks.
  2. By time-stamping problems and drawing from posts beyond a given cutoff date, LiveAoPSBench offers a valuable approach to reduce dataset leakage, giving a more faithful measure of actual LLM reasoning ability.
  3. The authors provide a thorough set of experiments across multiple math benchmarks (Omni-Math, OlympiadBench, etc.) and demonstrate that training on AoPS-Instruct consistently boosts performance. They also show a strong correlation between LiveAoPSBench and a human-verified Olympiad dataset, indicating good dataset quality.

Weaknesses:

The paper is well-structured and effectively conveys its main points. Improving mathematical reasoning capabilities of LLMs is an important problem in the research field, and the authors propose the corresponding method to address this problem. However, the experimental validation presented in this paper is weak.

  1. The evaluation is limited to the LLMs with fewer than 7B parameters. To strengthen their claims, the authors should extend their experiments to larger-scale LLMs (exceeding 7B parameters) and provide a more comprehensive assessment of the proposed method’s effectiveness.
  2. The evaluation set primarily consists of problems with concrete numeric or symbolic final answers. This excludes a large portion of Olympiad problems that require more open-ended proofs, which the pipeline and automated evaluations currently do not fully capture.

其他意见或建议

Figure 1 contains too many plotted lines, which may impact visual clarity and overall readability.

作者回复

We thank the reviewer for their constructive feedback. Below, we answer the main concerns raised by the reviewer:

Q1: The evaluation is limited to the LLMs with fewer than 7B parameters. To strengthen their claims, the authors should extend their experiments to larger-scale LLMs (exceeding 7B parameters) and provide a more comprehensive assessment of the proposed method’s effectiveness.

A1: We do not currently have the resources to fine-tune larger than 7B models, but we believe we provide substantial evidence that we hope motivates the community to push this further. To reiterate on some of the evidence we show in the paper:

  • Table 2 presents results for two 7B math-specialized models, alongside Llama 1B and 3B models, all consistently demonstrating improved accuracy across the board.
  • Table 6 in the appendix shows that performance gains transfer from a smaller 1.5B rewriting model to a larger 7B model.
  • For evaluation, we tested models up to 72B on our LiveAoPSBench, highlighting the benchmark’s difficulty and its correlation with uncontaminated benchmarks (Figure 5a).

Q2: The evaluation set primarily consists of problems with concrete numeric or symbolic final answers. This excludes a large portion of Olympiad problems that require more open-ended proofs, which the pipeline and automated evaluations currently do not fully capture.

A2: Evaluating proof-based questions has remained a long-standing and persistent challenge in the mathematical reasoning community, with no widely accepted solution outside formal languages like Lean—which fall beyond the scope of our work. A key limitation is the lack of large-scale training data for informal proof-based problems. As noted in our paper, approximately 30% of our training dataset consists of proof-based questions. We are optimistic that this dataset will support the development of new methods for evaluating proof-based reasoning.

审稿意见
2
  1. The paper constructs a dataset for Olympiad-level mathematical reasoning with a large scale and diverse problems, which is significant for LLMs’ development in mathematical problem-solving.
  2. Experiments demonstrate that the dataset can effectively improve LLM performance on benchmarks like MATH. Also, this dataset can serve as a benchmark with some resistance to contamination, effectively evaluating model performance

给作者的问题

  1. NuminaMath also sources data from AoPS and rewrites solutions using GPT-4o. How does your approach compare to theirs, and what advantages does it offer?

论据与证据

Yes

方法与评估标准

  1. Insufficient methodological details: 1.1) No conflict resolution protocol for handling discreprancies in community-provided answers (e.g., voting mechanisms). 1.2) Lack of validation for each step in the dataset construction pipeline (e.g. whether LLM cross-cheching effectively removes incorrect answers).

  2. The current approach detects duplicate data via substring matching but fails to identify semantically equivalent problems with different wording or multilingual translation issues (e.g., Chinese-to-English translations). Could semantic similarity be used instead?

理论论述

N/A

实验设计与分析

  1. The paper does not effectively demonstrate that timestamps mitigate contamination. There is no quantitative analysis of new problem duplication rates, such as statistics on overlap between different time periods and previous datasets.

补充材料

No.

与现有文献的关系

Similar work (such as the NuminaMath dataset) has been done on this topic. The contribution of this paper is marginal.

遗漏的重要参考文献

N/A

其他优缺点

Constructing a large-scale and high-quality dataset for mathematical reasoning is an important task and will promote the research on mathematical reasoning. However, the contribution of this paper is marginal. It is somewhat weak technically.

其他意见或建议

N/A

作者回复

We thank the reviewer for their constructive feedback. Below, we answer the main concerns raised by the reviewer:

Q1: No conflict resolution protocol for handling discreprancies in community-provided answers (e.g., voting mechanisms).

A1: For the LiveAoPSBench evaluation set, we only keep questions that have a closed-form answer, and we handle discreprancies between community-provided via LLM cross-check.

In the AoPS-Instruct training set, we include all recognized solutions based on forum discussions for two reasons: (1) Most of the Q-A pairs in training set either have no final answer (e.g., proof-based questions) or the answer is free form, making it hard to rely on voting mechanisms to resolve discrepency. (2) Prior works such as Alphacode[1], have successfully used partially correct solutions for training their models. Therefore, we choose not to apply majority voting filters, allowing users to customize the filtering according to their needs.

[1] Li et al., Competition-Level Code Generation with AlphaCode, 2022

Q2: Lack of validation for each step in the dataset construction pipeline. (whether LLM cross-cheching effectively removes incorrect answers.)

A2: In Figure 5.b we show the effectiveness of LLM rewritting. The effectiveness of LLM cross-checking has been qualitatively verified by human annotators. In Section 4.4, a human-annotated error rate of 8% was reported, whereas we empirically observed an error rate of over 20% before applying LLM cross-checking.

Q3: The current approach detects duplicate data via substring matching but fails to identify semantically equivalent problems with different wording or multilingual translation issues (e.g., Chinese-to-English translations). Could semantic similarity be used instead?

A3: N-gram decontamination is a standard practice in math data. We follow the DeepSeek-Math [2] paper, which uses 10-gram decontamination. Qwen-Math also applies decontamination based on 13-grams [3]. [4] Leverages the LLM for semantic based decontamination, but is not widely adpoted due to its inefficiecy when scaling up, and high false-positive rate.

[2] Shao, et al., DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, 2024 [3] Yang, et al., Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement, 2024 [4] Yang, et al. "Rethinking benchmark and contamination for language models with rephrased samples." arXiv preprint arXiv:2311.04850 (2023).

Q4 The paper does not effectively demonstrate that timestamps mitigate contamination. There is no quantitative analysis of new problem duplication rates, such as statistics on overlap between different time periods and previous datasets.

A4: Thank you for the suggestion! We demonstrate the correlation of timestamps vs contamination by decontaminate against the massive Numnia Training set following the decontamination set up of Qwen2.5[3]. Below we report a change of contamination rate w.r.t to timestamp. We report the statistics of overlap against Numina CoT Traing dataset (released July.2024) as below:

TimeJan-Apr 2023May-Aug 2023Sep-Dec 2023Jan-Apr 2024May-Aug 2024
10-gram Overlap Rate13.24%11.65%12.82%9.92%6.88%
Overlapped Question229208218226109
Total Questions17301785170122781585

As we can see, the rate of potential contamination significantly decreases as timestamps increase. We will add this info to our revised manuscript.

Q5: NuminaMath also sources data from AoPS and rewrites solutions using GPT-4o. How does your approach compare to theirs, and what advantages does it offer?

A5: While Numina source some problems from AoPS contest pages, their problem-solution extraction mechanism is very simple, leading to very small number (30K QA pairs) of extracted solutions compared to our method. Their method only extracts problems from contest page (where the problems are formally written), and their solution extraction method takes the longest post which contains a black box at the end of the proof.

Instead, we explore the entire AoPS forum discussions, where questions and solutions are shared in unstructured posts by community members. This requires extensive use of LLMs for data filtering and solution extraction, allowing us to extract 652K QA pairs (in contrast to Numina's 30K), resulting in a 21-fold increase in extracted data from AoPS. These improvements result in a stronger dataset, leading to higher performance gains as shown in Table 2 of our paper.

审稿意见
3

This paper introduces AoPS-Instruct, a dataset of 666K Olympiad-level math QA pairs, and LiveAoPSBench, a contamination-resistant benchmark, both sourced from the Art of Problem Solving (AoPS) forum. Using an automated pipeline, the authors extract and refine QA pairs, leveraging Qwen 2.5 72B to rewrite solutions into step-by-step explanations. Fine-tuning LLMs on AoPS-Instruct improves performance on benchmarks like OlympiadBench and Omni-MATH, while LiveAoPSBench reveals performance drops over time, indicating prior benchmark contamination and thus, highlights the importance of continually evolving datasets for fair LLM evaluation.

给作者的问题

Please see Methods and Evaluation Criteria and Experimental Design and Analyses

论据与证据

The paper provides sufficient empirical evidence for all the claims made.

方法与评估标准

The description of methodology lacks details about the motivation behind certain design choices as well as other details.

  • Why is Llama-3.1-70B used for Question-Answer Extraction as compared to a chat variant of the Qwen-2.5 series.
  • In Step 3 (solution rewriting), how is the solution that is to be re-written selected out of all the solutions provided by different users as extracted from the forum.
  • Are any measures taken to remove the 5% incorrect and 3% no-answer category questions from the final evaluation set? Presence of these amounts of degenerate questions could cause significantly wrong judgements about the performance of models.

理论论述

The paper does not make any theoretical claims

实验设计与分析

I have cetain questions and suggestions about the experiments reported in the paper.

  • In Table 2, the size of the datasets on which the models have been finetuned have not been specified. I believe it would be important to ensure that the amount of data on which the models are finetuned is same across different datasets in order to ensure fair comparison
  • For the comparison in Table 2, it would be useful to have a comparison with a stronger baseline. Several powerful synthetic datasets (for eg. ScaleQuest [1] and DART-Math [2] are plausible candidates)
  • The number of seeds over which the performances are computed has not been reported. It is important to discuss the statistical significance of small differences (such as the difference in performance of models on AoPS-Ins and Numina+AoPS-Ins in Table 2. It would be useful to report the performances of some state of the art models such as o1, DeepSeek-R1, DeepSeek-R1 distilled series, etc. on LiveAoPSBench. This would give a better idea of the difficulty level of the benchmark.

补充材料

I have gone through some of the evaluation questions attached in the supplementary material attached with the submission. I have not gone through the code in detail.

与现有文献的关系

This paper falls within the vast literature on developing novel benchmarks for evaluating mathematical reasoning abilities of LLMs. More specifically, this work focuses on developing contamination resistant evaluation benchmarks but continually collecting new olympiad level questions discussed on the AoPS forum. Another work which attempts to tackle the problem of evaluation data contamination includes [3] which uses an AI-Assisted approach for creating new and difficult mathematical questions.

遗漏的重要参考文献

I believe the following papers are worth discussing:

Synthetically Generated Training Datasets for Math Reasoning

[1] Ding el al., 2024; Unleashing Reasoning Capability of LLMs via Scalable Question Generation from Scratch; https://arxiv.org/abs/2410.18693 [2] Tong et al., 2024; DART-Math: Difficult Aware Rejection Tuning for Mathematical Reasoning; https://arxiv.org/abs/2407.13690

Connections with approaches for creating math evaluation benchmarks to address data contamination in existing benchmarks

[3] Shah et al., 2024; AI-Assisted Generation of Difficult Math Questions; https://arxiv.org/abs/2407.21009

其他优缺点

Strengths

The paper is well motivated and presents systematic pipelines for creating olympiad level training datasets and contamination resistant evaluation datasets. The paper is generally well written and provides a good set of experiments and ablation studies to show the usefulness of the approach

Weaknesses

Please refer to Methods and Evaluation Criteria and Experimental Design and Analyses sections

其他意见或建议

I have mentioned all comments and suggestions that I have in previous sections. The paper attempts to tackle an important problem and proposes a pipeline which is more generally applicable. However, I believe it lacks certain important clarifications and details which I have highlighted previously.

作者回复

We thank the reviewer for their constructive feedback.

Q1: In Table 2, the size of the datasets on which the models have been finetuned have not been specified. It would be important to ensure that the amount of data on which the models are finetuned is same across different datasets to ensure fair comparison

A1: We used the full Numina dataset, decontaminated against our evaluation benchmarks, resulting in 824K data points. In comparison, our dataset contains 647K data points. Despite being smaller, our dataset demonstrates a greater performance improvement than Numina, as highlighted in Table 2. We will add this information to our revised manuscript.

Q2: For the comparison in Table 2, it would be useful to have a comparison with a stronger baseline. Several powerful synthetic datasets (for eg. ScaleQuest [1] and DART-Math [2] are plausible candidates).

A2: ScaleQuest: Thank you for bringing our attention to this very recent work. We will add a citation to this work as a concurrent work.

DART-Math: Thank you for pointing this work out. Below, we take the DeepSeek model from the DART-MATH paper, and compare it with our DeepSeek-fine-tuned model:

AoPS24MATHOlympiadBenchOmniMath
DART-Math13.353.622.515.2
AoPS-Ins19.058.824.317.8

We observe that our model outperforms the DART-MATH model on all benchmarks. We will add this table to our paper.

Q3: The number of seeds over which the performances are computed has not been reported.

A3: We measure the Pass@1 metric in all of our tables, and for non-R1 models, the common practice is to set a temperature of zero for performance measurement. Therefore, there is no stochasticity in the metric to repeat for multiple seeds. This is in line with other works in mathematical reasoning (e.g., see [2] and [3]) where there are no standard deviations to report for Pass@1.

Q4: It would be useful to report the performances of some state of the art models such as o1, DeepSeek-R1, DeepSeek-R1 distilled series, etc. on LiveAoPSBench. This would give a better idea of the difficulty level of the benchmark.

A4: We have reported the performance of R1-distilled models in Figure 8 in the Appendix. However, assessing the full R1 and o1 models is both computationally and financially prohibitive due to their high operational costs, which prevents us from including their results.

Q5: Are any measures taken to remove the 5% incorrect and 3% no-answer category questions from the final evaluation set? The presence of these amounts of degenerate questions could cause significantly wrong judgments about the performance of models.

A5: LiveAoPSBench is a dynamically evolving benchmark, and we continuously work to refine the pipeline as we identify failure cases. We should note that current large language models perform well below this error-rate threshold (e.g., DeepSeek distilled models achieve only 52.2% accuracy on our benchmark). Moreover, even the human-labeled GSM8K dataset is known to contain a small percentage of label errors [1], yet it remained a highly discriminative benchmark until state-of-the-art models began to saturate it. Thus, while we are committed to enhancing LiveAoPSBench, we believe it remains valuable and informative in its current state.

Q6: In Step 3 (solution rewriting), how is the solution that is to be re-written selected out of all the solutions provided by different users as extracted from the forum.

A6: We input the whole topic discussion to the LLM, and the LLM must infer correct post indexes that contain correct solution. This is done by analyzing the forum discussion to detect correct answers. We use few-shot chain-of-thougt prompt to extract these (see Figure 12 for the exact prompt, and our code for few-shot examples).

Q7: Why is Llama-3.1-70B used for Question-Answer Extraction as compared to a chat variant of the Qwen-2.5 series.

A7: We apologize for the typo. In fact, we utilize Qwen-2.5 across the entire AoPS-instruct pipeline (as illustrated in Figure 1 and Table 1). We will correct this mistake in our revised submission.

Q8: Another work which attempts to tackle the problem of evaluation data contamination includes Shah et al., which uses an AI-Assisted approach for creating new and difficult mathematical questions.

A8: Thank you for highlighting this relevant work. The work of Shah et al. aims to produce an evaluation dataset that tests compositional generalization by combining pairs of distinct mathematical skills extracted from the MATH dataset. This differs from our methodology, which emphasizes dynamically evolving, community-driven problem-solving data. We appreciate the suggestion and will include a citation to this work in our revised manuscript.

[1] https://gradientscience.org/platinum-benchmarks

[2] Yang et al., Qwen2.5-Math

[3] Shao et al., DeepSeekMath

审稿人评论

I thank the authors for the detailed rebuttal and for clarifying my questions.

DART-Math comparison If I understand correctly, the DART-Math-DSMath-7B model referred to in the rebuttal by the author was obtained by fine-tuning DeepseekMath-7B (base model) using DART-Math data where as the model reported in the paper is a obtained by fine-tuning DeepseekMath-7B-Ins (i.e., instruct tuned version) on the AoPSInstruct dataset. This is not a fair comparison since instruct tuned models usually lead to better fine-tuned models.

The empirical evaluation of the effectiveness of AoPSInstruct seems limited in general. I would encourage the authors to include more base models and comparison with datasets such as ScaleQuest and DART-Math (a subset of size equal to AoPSInstruct, and if affordable within the compute budget of the authors) in the revised versions.

Correctness of Data I also maintain my concern about the error rates in the eval dataset. While it is true that existing models do not saturate the dataset and that previous benchmarks have had errors as well, I believe that evaluation benchmarks should held up at the highest standards.

Despite the above concerns, considering that the authors addressed my other concerns satisfactorily, and the general utility of the pipeline, I am increasing my score.

作者评论

We thank the reviewer for their continued engagement and for acknowledging the utility of our proposed pipeline.

DART-Math Comparison:
We appreciate the reviewer’s observation regarding the fairness of the comparison between DART-MATH and AoPS-Instruct. To address this concern, we took the DART-MATH-Hard dataset and fine-tuned it for three epochs on DeekSeek-Math-Instruct using our setup ensuring a more direct and fair comparison. The updated results are as follows:

DatasetSizeAoPSOlympiadBenchOmni-MATHMATH
DART-MATH-Hard585k14.421.815.452.5
AoPS-Ins647k19.024.317.858.8

As shown, AoPS-Instruct fine-tuning still outperforms DART-MATH-Hard across all four benchmarks. We will add this experiment to our revised manuscript for further validation.

最终决定

The paper presents a benchmark based on art of problem solving forum. Compared to earlier efforts like NuminaMath, the benchmark is an order-of-magnitude larger. The most notable factor is the capability of creating a liveAOPSbench that can continually evaluate language models. Reviewers appreciated the benchmark contribution and evaluation on different models, especially on the LiveAOPSBench. Some reviewers requested that bigger models be evaluated on the benchmark, but there are reasonable number of models evaluated on the live version. Overall, I'm very supportive of the paper's contribution on a live contamination-avoiding benchmark.