/10

Poster4 位审稿人

最低3最高5标准差0.8

ICML 2025

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

提交: 2025-01-21更新: 2025-07-24

TL;DR

We design a self-play algorithm for theorem proving that iteratively conjectures new statements and proves them, achieving state-of-the-art performance on miniF2F and ProofNet among whole-proof generation methods.

摘要

关键词

formal theorem provingAI for mathself-playlarge language modelreasoningreinforcement learning

评审与讨论

审稿意见

评分: 32025-02-21

This paper introduces STP, a self-playing training framework for automated theorem proving. STP employs a conjecturer to generate new conjectures based on existing theorems and lemmas, while a prover attempts to prove previously unproven conjectures or statements in an iterative process. Experiments on the LeanworkBook and miniF2F benchmarks show that STP is more sample-efficient than expert iteration and reinforcement learning, achieving state-of-the-art performance on both datasets across the Lean and Isabelle proof assistants.

给作者的问题

Could the paper include more ablation studies? Specifically, what is the accuracy of STP after the SFT stage? Additionally, what is the accuracy of STP after self-play training without retraining?

Furthermore, could you provide more examples of how STP generates variants of existing theorems? Does the conjecturer primarily produce slight variations of input theorems, or can it genuinely generalize to harder problems? It would be valuable to conduct a small analysis comparing the difficulty of the generated conjectures to the original theorems to better understand STP’s ability to propose more challenging problems

论据与证据

Most claims in the submission are well-supported by evidence.

方法与评估标准

I think the proposed methods and evaluation criteria are well-motivated and logically sound.

理论论述

N/A

实验设计与分析

I think most of the experiment designs and analyses make sense.

补充材料

Yes.

与现有文献的关系

I think the key contribution of this paper is the conjecturer, which can generate new, potentially provable training data, helping to mitigate data scarcity and the sparse reward problem in formal theorem proving. This idea is very similar to some approaches in general reasoning tasks (e.g., [1,2]), which also leverage LLM-based rewriting or permutation methods to generate variant training data.

[1] MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

[2] WizardMath: Empowering Mathematical Reasoning for Large Language Models via Reinforced Evol-Instruct

遗漏的重要参考文献

Besides the above references, I think the following work is not currently discussed in the paper: [1], which proposes generating theorems and proofs from seed concepts. Moreover, I believe the discussion of [2] is not entirely fair, as it focuses on a more challenging setting where unpretrained LLMs generate harder conjectures than those in the training set, thereby avoiding data contamination issues. In contrast, STP in this paper fully leverages an LLM to generate variants of existing theorems rather than entirely harder conjectures.

[1] MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

[2] Learning Formal Mathematics From Intrinsic Motivation

其他优缺点

I think the paper is well-motivated and well-written, with comprehensive experiments. I also appreciate the demonstrated examples, which show that the conjecturer can indeed generate more generalized—and potentially harder—problems given input theorems and lemmas. Additionally, I find the design of reward assignments and filtering strategies well thought out, all of which make sense to me.

One major concern I have is that the training dataset design for SFT-conjecture data does not seem to explicitly encourage the conjecturer to generate harder theorems. Instead, it appears to primarily promote the generation of variants of existing theorems, given that theorem X and theorem Y originate from the same proof file. There is no guarantee that the conjecturer will propose genuinely harder conjectures, as its design relies entirely on the diverse sampling capabilities of LLMs.

其他意见或建议

Minor: In the paper, the authors state that ''RL’s capability is fundamentally bounded by the difficulty level of the theorems in the training dataset—it is unlikely, in principle, for a model to learn college-level proof techniques solely by working on high school-level problems or to solve open math problems using RL on graduate-level problems''. However, I don't think the proposed STP can truly address such a distribution shift problem, as the conjecturer primarily generates variants of existing theorems and remains heavily reliant on the training dataset.

作者回复

2025-04-01

We thank reviewer 76YT for their positive review, and for noting “the paper is well-motivated and well-written, with comprehensive experiments”.

I think the following work is not currently discussed in the paper: [1],

Thank you for the comments. We will include a discussion about [1] upon revision.

[1] MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data

the training dataset design for SFT-conjecture data does not seem to explicitly encourage the conjecturer to generate harder theorems. [...]

The conjecture dataset at the SFT stage only serves as an initialization, and we encourage the conjecturer to generate harder conjectures in STP’s iterative training — at every iteration, the conjecturer is continuously trained on previously generated conjectures on which the prover has a low pass rate. Therefore, as the prover gets better, the training data of the conjecturer will encourage harder conjectures at the next iteration.

Minor: In the paper, the authors state that ''RL’s capability is fundamentally bounded by the difficulty level of the theorems in the training dataset [...]''. However, I don't think the proposed STP can truly address such a distribution shift problem, as the conjecturer primarily generates variants of existing theorems and remains heavily reliant on the training dataset.

Thanks for the comments! We agree with the reviewer that the STP at its current form is unlikely to discover fundamentally novel proof techniques, and we do not claim that STP truly addresses this particular issue. This paragraph serves as a high-level motivation and STP is only a first step toward this direction (more promising than the standard RL baseline). Moreover, even though STP primarily generated variants of existing theorems, it does discover some new interesting ones, at least harder forms of existing theorems. We also only trained with limited compute and data, and the full potential of this line of ideas still requires more future work.

what is the accuracy of STP after the SFT stage? Additionally, what is the accuracy of STP after self-play training without retraining?

We thank the reviewer for their comments. The SFT checkpoint is not very different from the base model (DeepSeek-Prover-V1.5-SFT):

Method	Sample budget	miniF2F-test	ProofNet-test
SFT checkpoint	1	35.7%	3.8%
	128	51.4%	15.4%
	3200	55.7%	19.4%
STP	1	41.1%	5.9%
	128	57.2%	18.0%
	3200	61.1%	23.1%

And STP is better than STP w/o retraining. In general, STP w/o retraining has a higher pass@1 but lower pass@k for large k.

Method	Sample budget	miniF2F-test	ProofNet-test
STP w/o retraining	1	45.1%	7.8%
	128	57.2%	16.9%
	3200	60.7%	21.5%
STP	1	41.1%	5.9%
	128	57.2%	18.0%
	3200	61.1%	23.1%

We will include these results upon revision.

Furthermore, could you provide more examples of how STP generates variants of existing theorems? Does the conjecturer primarily produce slight variations of input theorems, or can it genuinely generalize to harder problems? It would be valuable to conduct a small analysis comparing the difficulty of the generated conjectures to the original theorems

The conjecturer generates both variants and more challenging problems. On average, the proof length of the conjectures is 1.98 times that of the seed statements, implying they are generally more difficult to prove. As an example, the following conjecture involves the square root function, where the seed theorem is about polynomials:

Seed theorem

theorem lean_workbook_plus_36664 (a b : ℝ) (ha : 0 < a) (hb : 0 < b) (hab : (1 + a^2) * (1 + b^2) = 4) : a + b + a * b ≤ 3

Conjecture

theorem lean_workbook_plus_36684 (a b c : ℝ) (ha : 0 < a) (hb : 0 < b) (hc : 0 < c) (hab : a * b * c = 1) :   Real.sqrt ((a ^ 2 + 1) / b) + Real.sqrt ((b ^ 2 + 1) / c) + Real.sqrt ((c ^ 2 + 1) / a) ≥ 3 / (a + b + c)

Or the generated conjectures can have very different proofs:

Seed theorem

theorem lean_workbook_plus_12215 : sin (π / 2) = 1

Conjecture

theorem lean_workbook_plus_60  : ¬(∀ f : ℝ → ℝ , (∀ x :ℝ, f (x + π / 2) = f x - 1) ↔ f = fun x ↦ cos x)

审稿人评论

2025-04-04

Thanks for the clarification—I still hold a positive view of the paper. From the examples, it appears that the conjecturer can produce quite diverse conjectures from the same seed theorems. I do have a follow-up question: do these generated conjectures tend to preserve similar properties as those in the training data? For instance, do they rely on some of the same lemmas in their proofs?

作者评论

2025-04-04

We thank reviewer 76YT for acknowledging our response and providing additional comments.

Do these generated conjectures tend to preserve similar properties as those in the training data? For instance, do they rely on some of the same lemmas in their proofs?

Yes, we can use the distribution of lemmas in the proofs as a proxy for similarity between the generated conjectures and the training data theorems. Below are the 20 most frequently used lemmas in the proofs of generated conjectures, most of which are about algebra and inequalities:

sq_nonneg, mul_self_nonneg, div_le_div_iff, Real.sq_sqrt, div_le_iff, Real.sin_sq_add_cos_sq, pow_two, Real.sin_add, Real.cos_add, Real.sqrt_pos, div_nonneg, Real.cos_sub, Real.sin_sub, pow_pos, Real.cos_two_mul, Real.cos_sq, div_le_one, Real.sin_two_mul, pow_add, pow_mul

Among these 20 lemmas, 17 also appear among the top 20 most frequent lemmas in the proofs of LeanWorkbook theorems, suggesting that the generated conjectures are generally aligned with the problems in LeanWorkbook.

We also conducted a granular analysis by examining the overlap of lemmas between each generated conjecture and its corresponding seed theorem. At a late checkpoint of our experiments, we found that 51.1% of correct proofs for generated conjectures share at least one lemma with the proof of the seed statement. Specifically, 33.44% share exactly one lemma, 12.45% share two, 4.22% share three, and about 1% share four or more. This indicates that a substantial portion of generated conjectures are meaningfully related to their seed theorems.

The shared lemmas also span diverse topics. Below are the 30 most frequently shared lemmas. While inequalities and algebra remain dominant (e.g., sq_nonneg, Real.sin_sq_add_cos_sq), we also see lemmas related to sequence of products (i.e., Finset.prod_range_succ'), and number theory (e.g., Nat.pow_mod, nat_sub_dvd_pow_sub_pow).

(Note: this list differs from the one above because it only counts lemmas shared between a conjecture and its seed theorem, whereas the previous list counts all lemmas used in generated conjectures.)

sq_nonneg, mul_self_nonneg, div_le_div_iff, Real.sq_sqrt, Real.sin_add, Real.sin_sq_add_cos_sq, Real.cos_add, pow_two, div_le_iff, Real.cos_sub, Real.sin_sub, Real.cos_sq, Real.sin_two_mul, Real.cos_two_mul, pow_mul, div_le_one, Real.sqrt_pos, pow_add, Real.tan_eq_sin_div_cos, abs_le, Finset.prod_range_succ', pow_pos, Real.cos_sq_add_sin_sq, Nat.sum_range_choose, Nat.pow_mod, pow_three, abs_mul, abs_cases, Real.pi_pos, nat_sub_dvd_pow_sub_pow

We will include these discussions in the next revision.

审稿意见

评分: 32025-03-10

The paper proposes a novel method called Self-play Theorem Prover (STP). STP addresses the shortage of high-quality training data in automated formal theorem proving by simultaneously training two roles: a conjecturer that produces new theorems (or “conjectures”) and a prover that attempts to prove them. This approach is aimed to overcome the plateau seen in reinforcement learning (or “expert iteration”) methods.

STP continually generates fresh conjectures that match the model’s current skill level—“barely provable” ones (the proof rate (0, 1/4]. The prover’s successful proofs of these conjectures supply training data to improve the model’s proof-writing capabilities. Through experiments using Lean and Isabelle, STP is shown to outperform standard baselines.

给作者的问题

I'd love to see what is the limit of the method. Does it plateau, when, why?

论据与证据

The main claim of the papers is that it enables data generation via a self-play mechanism. The key novel element is a generation mechanism of well-calibrated conjectures. The benefits of this are shown by comparisons with two baselines: expert iteration and parallel sampling. The paper claims significant improvement with respect to these baselines.

The paper also claims to outperform DeepSeek-Prover-V1.5-RL on the Mini-F2F and ProofNet benchmarks. The evidence is pass rates gathered in Table 1.

方法与评估标准

The method can be shortly summarised as follows:

A single large language model (LLM) alternates between two tasks: 1. Conjecturer, that generates new mathematical statements (conjectures). 2. Prover – Attempts to prove the generated conjectures along with existing theorems.

Key Steps of STP: 1. Generating Conjectures: • Given a known theorem and proof, the conjecturer synthesizes new conjectures that are barely provable 2. Proving Conjectures and Theorems: • The prover attempts to prove both newly generated conjectures and existing unproven theorems from a dataset (successful ones are added to the dataset) 3. Training with Dense Signals: • The prover is fine-tuned on successful proofs. • The conjecturer is updated based on which conjectures led to useful proofs. 4. Goto 1.

The method is very intuitive and reasonable. Its main design is unsurprising; the value lies in proposing implementation details. These details feel correct and valid; at the same time, they are somewhat arbitrary and not well justified or motivated. I'd love to see more ablations! (I totally understand that the cost is quite high, though).

I find the evaluation criteria perhaps the weakest point of the paper. My criticism is as follows:

Fig 2 - I am not sure how strong the baselines are (and how much effort has been given into tuning them)
Fig 3 - I find comparisons with Deepseek-Prover not quite informative without careful information on the training budget of the two methods (and any detail to ensure an apple-2-apple comparison)
Table 1 - the same as above. Moreover, I am not sure what the difference is between the Sample budget (#proofs) and sample budgets (#steps).
another subtle issue about the comparison of STP and deepseek is how much the presented method 'overfits' to benchmarks. Following that, I'd love to see a deep analysis of the generalization properties (although doing this properly might be another paper)

Having said that, I stress that I am not an expert in the field. It's hard for me to full understand the results and used benchmarks.

理论论述

The paper does not contain any significant theoretical claims.

实验设计与分析

Correct

补充材料

I've briefly reviewed the supplementary material, which feels fine but not very exhaustive.

与现有文献的关系

I am not an expert in the field. The related work covers most of the essential topics.

遗漏的重要参考文献

In the RL parlance, the presented work can be framed as an exploration problem and curriculum construction. I find it strange not to include a short discussion about these topics. I do not insist on going deep.

其他优缺点

see above

其他意见或建议

see above

作者回复

2025-04-01

We thank reviewer vtLT for their review. In the following, we address the reviewer’s questions/comments in detail.

Fig 2 - I am not sure how strong the baselines are (and how much effort has been given into tuning them)

We spend equal efforts in tuning the baselines and STP (if not more efforts on tuning the baseline). We mostly use off-the-shelf configurations for finetuning LLMs, and only optimize the learning rate and sampling parameters. Our reproduction of DeepSeek-Prover-V1.5-RL on LeanWorkBook is actually better than the best previously reported accuracy. We believe that the amount of tuning is more than sufficient given the enormous gap between our methods and baselines — we don’t expect the hyperparameters in the baseline to make any difference nearly comparable to the gap between ours and the baselines.

Fig 3 - I find comparisons with Deepseek-Prover not quite informative without careful information on the training budget of the two methods (and any detail to ensure an apple-2-apple comparison)

First, we’d like to point out that our model is trained on top of Deepseek-Prover-V1.5-SFT. Hence, we assume that the reviewer’s question is about the comparison between STP and Deepseek-Prover as if it were trained with the same additional compute (please kindly let us know otherwise).

Since Deepseek-Prover-V1.5-SFT is trained with expert iteration, our expert iteration baseline simulates the performance of DeepSeek’s model as if it were trained for more iterations. Note that the training compute is approximately proportional to the number of generated proofs during training (for our method, we count the proofs of both the statements in the given dataset and the generated conjectures; for expert iteration, we count the proofs of the statements in the given dataset). Therefore, Figure 2 is an apple-to-apple comparison between our method and DeepSeek-Prover.

Table 1 - the same as above. Moreover, I am not sure what the difference is between the Sample budget (#proofs) and sample budgets (#steps).

Sample budget (#proofs) and sample budgets (#steps) are two different ways to measure the inference-time compute. For whole-proof generation methods (generating an entire proof auto-regressively by LLM, such as STP), sample budget (#proofs) is the most common metric for inference-time compute, which means the total number of proofs generated independently per problem. Sample budgets (#steps) is mostly used for tree search methods that use LLMs to generate single proof steps (e.g., a single line in the proof) then search for a complete proof by best first search or MCTS. We introduce the two approaches for completeness.

another subtle issue about the comparison of STP and deepseek is how much the presented method 'overfits' to benchmarks. Following that, I'd love to see a deep analysis of the generalization properties (although doing this properly might be another paper)

We assume that the reviewer’s question is about whether any method overfits to the benchmarks like miniF2F, ProofNet, and PutnamBench. The STP checkpoint (Line 291) is only trained on LeanWorkbook, and we do not use any early-stopping methods that may leak information about the test datasets. In addition, miniF2F, ProofNet, and PutnamBench cover statements from different sources and of different difficulty. Therefore, we believe that it is unlikely for STP to overfit to all three test benchmarks at the same time.

(If the reviewer’s question is about whether STP or DeepSeek-Prover overfits to formal proofs, the answer is yes, but that’s expected because they are only trained on formal proofs. None of the models can generalize to natural language tasks, and we believe that generalizing formal proofs to natural language tasks is still an open question.)

In the RL parlance, the presented work can be framed as an exploration problem and curriculum construction. I find it strange not to include a short discussion about these topics. I do not insist on going deep.

We thank the reviewer for the comments. Our method can indeed be framed as automatic curriculum learning (e.g., [1]) and we included some of the related work in Line 177-183. Upon revision, we will include a more broad discussion on their connections.

[1] Portelas, Rémy, et al. Automatic curriculum learning for deep rl: A short survey.

I'd love to see what is the limit of the method. Does it plateau, when, why?

After the submission, we run STP for longer, and it continues to improve. With about 50B generated tokens (requiring 86K TPU-v4 hours in total; 2.5x the compute used in the submitted experiment), STP reaches at approximately 28.5% pass rate on LeanWorkbook, significantly higher than the best reported in the paper. We don’t have compute to run even further. We also estimate that only about 50% of LeanWorkBook statements are correct, and thus, 28.5% is already pretty high.

审稿意见

评分: 52025-03-14

The paper studies the problem of improving LLM theorem provers through training on self-generated conjectures. The approach, called Self-play Theorem Prover (STP), is a framework to train LLM theorem provers in a dual conjecturer-prover setup, evaluated with Lean and Isabelle as formal verifiers. STP is initialized with supervised fine-tuning on proof libraries like Mathlib, then iterates through a self-play loop: the conjecturer, built on DeepSeek-Prover-V1.5-SFT, generates conjectures from seed theorems and lemmas, guided by a reward function targeting low but positive empirical pass rates and an elegancy score. The prover, trained via expert iteration attempts proofs with a fixed sampling budget per statement / conjecture and the proofs are verified with Lean / Isabelle. With 19.8 billion tokens generated over 24 Lean iterations (120M proofs, 2M conjectures), STP achieves a 26.3% cumulative pass rate on LeanWorkbook and strong performance on miniF2F-test, ProofNet-test. The approach is also evaluated on Isabelle, using Llemma-7b as the base model, and show strong performance on a translated version of the LeanWorkbook and PutnamBench and observe strong performance gains.

Update after rebuttal:

The authors clarified several of my concerns during the rebuttal and considering the other reviews, I maintain my positive assessment of the work

给作者的问题

Please see the sections above.

论据与证据

C1: Propose a self play mechanism for training LLM Theorem provers which enable continuous improvement without requiring additional data and achieves state-of-the-art among whole proof generation methods

This is the central claim and contribution of the paper. The experiments in Section 4 provide much of the evidence for the claim. In particular Figure 2, 4 and 5, provide evidence for continuous improvement over iterations of conjecture generation and proving. Table 1 provides the results on miniF2F and ProofNet and Table 3 consists of results on PutnamBench, and STP achieves state of the art performance, outperforming DeepSeek-Prover-v1.5 (I’d like to note that DeepSeek-Prover baseline is missing in Table 3). The results also demonstrate strong performance of the method across two different base models (Llemma and DeepSeek-Prover-1.5-SFT) as well as two different proof verification systems (Lean and Isabelle). Overall there is strong evidence for the claim.

C2: Self play mechanism provides denser training signals making training easier and retraining with generated conjectures improves performance

The authors conduct two ablations, with results presented in Fig 6 and Table 2, which provide sufficient evidence for these claims.

方法与评估标准

The authors use standard benchmark datasets miniF2F, ProofNet, and PutnamBench. The methods used for comparison are also established state-of-the-art to the best of my knowledge.

理论论述

N/A

实验设计与分析

The experiment design overall is quite sound, but there are a couple things which seem lacking:

The DeepSeek-Prover baseline appears to be missing in Table 3.
There are some examples of generated conjectures in the paper, but I think a bit more analysis of the conjectures would be useful. For instance, some details about what the distribution of topics looks like among the generated conjectures or how similar they are to the problems in LeanWorkbook.

补充材料

I checked some of the implementation details in Appendix A as well as the additional results in Appendix B.

与现有文献的关系

A key contribution of the paper is to demonstrate that self-play can be employed at scale for training strong LLM theorem provers. This contribution is important in the context of theorem proving, and also presents potential avenues for the broader topic of reasoning with LLMs.

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

The paper is generally quite well written and easy to follow. There are also many details about the approach which make it easy to understand.
The overall approach is also relatively straight-forward, which in my opinion is a strength since it makes the approach easier to use for others to to build upon and scale.

Weaknesses:

The authors do not include code to reproduce the results. I appreciate the details in the paper but releasing the code would be an important contribution (considering the strong results)
As the authors discuss in the paper, the overall approach is fairly similar to Minimo (Poesia et al. 2024) but scaled up to Lean, with simpler conjecture generation. The novelty of the approach is thus somewhat limited. To be clear this does not detract from the paper (as indicated by my positive score) but I believe it is worth mentioning.

其他意见或建议

N/A

作者回复

2025-04-01

We thank reviewer dMfo for their positive review, and for noting “The experiment design overall is quite sound”. In the following, we address the reviewer’s questions/comments in detail.

The authors do not include code to reproduce the results. I appreciate the details in the paper but releasing the code would be an important contribution (considering the strong results)

We have fully released our code, data, and model weights after the submission deadline. We also provide an anonymous version in the links here: https://anonymous.4open.science/r/STP_rebuttal-85F9.

The DeepSeek-Prover baseline appears to be missing in Table 3.

The original DeepSeek-Prover paper does not report the number on PutnamBench. In the following, we report the test result of DeepSeek-Prover-V1.5-RL model conducted by ourselves, and we will include the results in the paper upon revision.

Method	Sample budget	Result
DeepSeek-Prover-V1.5-RL	64	6/644
	3200	7/644

a bit more analysis of the conjectures would be useful. For instance, some details about what the distribution of topics looks like among the generated conjectures or how similar they are to the problems in LeanWorkbook.

Thank you for the comments. We can look at the distribution of lemmas used to prove the generated conjectures as a proxy of their topics. Here are the 20 used lemmas with highest frequency, and the majority of the topics are algebra and inequalities:

sq_nonneg, mul_self_nonneg, div_le_div_iff, Real.sq_sqrt, div_le_iff, Real.sin_sq_add_cos_sq, pow_two, Real.sin_add, Real.cos_add, Real.sqrt_pos, div_nonneg, Real.cos_sub, Real.sin_sub, pow_pos, Real.cos_two_mul, Real.cos_sq, div_le_one, Real.sin_two_mul, pow_add, pow_mul

Among these 20 lemmas, 17 also rank among the top 20 most frequent lemmas in LeanWorkbook. This suggests that the generated conjectures are generally similar to the problems in LeanWorkbook.

审稿意见

评分: 42025-03-14

The paper introduces the Self-play Theorem Prover (STP), a novel method for training large language models (LLMs) in formal theorem proving, addressing the scarcity of high-quality training data. STP employs an LLM in two roles: a conjecturer that generates new mathematical conjectures based on existing theorems and proofs, and a prover that attempts to prove these conjectures alongside statements from an existing dataset. The process is iterative, with the conjecturer trained on previously generated conjectures that are challenging yet provable by the current prover, enabling continuous improvement without additional external data. The method is evaluated on LeanWorkbook, miniF2F, ProofNet, and PutnamBench, achieving SoTA results across the board.

给作者的问题

What is the intuition behind the construction method of the conjecture dataset? Why can theorem Y be a good conjecture for theorem X?
How does the final re-training affect evaluation performance? What if it is removed?

论据与证据

The claims are well-supported by clear and convincing evidence. Table 1 compares STP with prior methods on miniF2F-test and ProofNet-test, showing higher pass rates and sampling efficiency compared to previous baselines and RL training. Figures 2 and 3 illustrate STP’s scaling advantage over expert iteration and Deepseek-Prover-V1.5 models.

One claim appears problematic: "STP proves 26.3% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration." I cannot find a reference for this reported 13.2%, and according to Figure 2, expert iteration appears to perform better than this stated number.

方法与评估标准

The STP method is well-suited for formal theorem proving. Its iterative conjecturing and proving process, inspired by mathematical practice, effectively addresses data scarcity through a self-adaptive curriculum. The evaluation criteria—pass rates on standard benchmarks such as LeanWorkbook, miniF2F-test, ProofNet-test, and PutnamBench—are appropriate and widely accepted in the field. Testing on both Lean and Isabelle verifiers demonstrates generalizability across formal languages.

理论论述

I reviewed the derivation in A.5 for the re-weighting method and it looks correct.

实验设计与分析

The experimental designs are sound and the comparisons are fair. The paper could be further improved through additional ablation studies on key design choices, for example, the filtering and re-weighting process for conjecturer training and the effectiveness of the final re-training phase.

补充材料

I reviewed the implementation details and pseudo-code for the conjecturing dataset in A.5 and additional results in Appendix B

与现有文献的关系

The work is well situated within current trends in using LLMs for theorem proving. It also builds on previous studies on RL and self-play training.

遗漏的重要参考文献

I can't notice any.

其他优缺点

Strengths:

The proposed method is novel and well-motivated;
Empirical results are strong and comprehensive;
The paper is well-structured, with intuitive explanations and illustrative figures.

Weaknesses:

Ablation studies on the key components of the proposed method are missing.

其他意见或建议

Line 34 in abstract: versifiers → verifiers

作者回复

2025-04-01

We thank reviewer ubJp for their positive review, and for noting “The claims are well-supported by clear and convincing evidence”. In the following, we address the reviewer’s questions/comments in detail.

One claim appears problematic: "STP proves 26.3% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.2% achieved through expert iteration." I cannot find a reference for this reported 13.2%, and according to Figure 2, expert iteration appears to perform better than this stated number.

The number 13.2% is reported in [1] — they prove 13.1% statements in LeanWorkbook and disproves 3.9% (we realize that we made a typo in the paper. The correct number should be 13.1%). We will add a reference in the paper upon revision.

Figure 2 shows the performance of expert iteration with the base model DeepSeek-Prover-V1.5-SFT, which is indeed better than 13.1%. However, in the original paper [2], they do not report any number on LeanWorkbook. Therefore, Figure 2 is based on our reproduction of [2].

[1] Wu, Zijian, et al. Internlm2. 5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems.

[2] Xin, Huajian, et al. Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search.

What is the intuition behind the construction method of the conjecture dataset? Why can theorem Y be a good conjecture for theorem X?

We construct the conjecture SFT dataset so that the generated conjecture (Theorem Y) is related to the given theorem (Theorem X) in the sense that they use the same seed lemma in the proof. The conjecture dataset at the SFT stage only serves as an initialization of the model, and the conjecturer will gradually learn to generate diverse, challenging yet approachable, and relevant conjectures by STP’s iterative training via the datasets constructed later (Section 3.2).

How does the final re-training affect evaluation performance? What if it is removed?

The following table compares STP and STP w/o retraining. In general, STP w/o retraining has a higher pass@1 but lower pass@k for large k. We will add this result in the paper upon revision.

Method	Sample budget	miniF2F-test	ProofNet-test
STP w/o retraining	1	45.1%	7.8%
	128	57.2%	16.9%
	3200	60.7%	21.5%
STP	1	41.1%	5.9%
	128	57.2%	18.0%
	3200	61.1%	23.1%

审稿人评论

2025-04-02

Thank you for the clarification. I maintain my positive assessment of the paper. Great work!

最终决定Accept (poster)

2025-05-01

The paper presents a new theorem-proving system called the "Self-Play Theorem-Prover." The system consists of two modules -- the Conjecturer and the Prover -- that repeatedly interact with each other. The Conjecturer is trained on previously generated conjectures that were found hard to prove. The Prover tries to prove these conjectures using standard expert iteration. The system is shown to lead to SOTA performance on several established benchmarks.

The reviewers were all supportive of the paper. While the overall approach is similar to the existing Minimo system, this paper is the first to scale the ideas to Lean, and the problem it studies is one of the central open challenges in AI-for-maths. Given this, I am recommending acceptance. Please incorporate the reviewers' feedback carefully in the final version.