A Theory for Worst-Case vs. Average-Case Guarantees for LLMs
Introducing models that prove their own correctness via an Interactive Proof, and how to learn such models.
摘要
评审与讨论
This paper investigates the usage of interactive proofs for learned models. They provide a generalized theory, and show that incorporating rounds of verification helps to produce guarantees on soundness (i.e. rejecting wrong answers with high probability). This lays the conceptual groundwork for applying learned models to decision problems solvable in PSPACE. The authors propose two training algorithms and prove convergence results under certain assumptions.
优缺点分析
Strengths: The background and introduction are strong and clear, providing a clear example of the benefits of self-proving models. The intuition and informal explanations provided about the verifier were helpful in lines 67-68 and in lines 134-142.
Weaknesses: Remark 2.5 cites the union bound in line 128, but it is more accurate to just say that the events of a true positive (correct output) and false positive (soundness error) are disjoint. Many of the core claims rely on the existence of a verifier with strong properties (e.g., sufficiently low soundness error), but the paper does not explain how such a verifier would be constructed or learned. This weakens the practical relevance of the theoretical results.
问题
Could you provide more guidance or examples on how a verifier might be instantiated in practice, even in a simplified setting? This would help clarify the practical applicability of the framework.
Have you considered demonstrating this approach on a constrained problem space where verifier construction is tractable, such as arithmetic or logic puzzles?
局限性
While a few illustrative examples are mentioned (e.g., lines 134 and 238–244), the paper lacks any empirical results or concrete instantiation of the proposed framework, making it hard to assess its practical feasibility.
As noted above, the strongest limitation is the absence of a method for producing a suitable verifier — which is essential to the framework’s claims. Even a toy example or partial empirical demonstration would significantly strengthen the paper.
These limitations are not sufficienlty acknowledged in the paper.
最终评判理由
The paper remains somewhat abstract and speculative but might be considered promising.
格式问题
N/A
We thank the reviewer for their constructive feedback and thoughtful suggestions. We address each point below:
Regarding Remark 2.5: Thank you. Indeed, the remark can also be formally proven in the way you are suggesting (by decomposing into the two disjoint events, rather than the union bound). We will update the wording in the remark.
Regarding verifier instantiation and empirical demonstration: We appreciate the reviewer's suggestion to demonstrate our approach in a simplified, arithmetic setting. We have added comprehensive experiments using GCD computation as our constrained problem space, with a verifier based on Bézout's theorem: for claimed GCD of integers and , the model provides coefficients such that ; the verifier then checks that divides both and , and that (this confirms that is the GCD, with certainty). This arithmetic setting provides the "simplified setting" the reviewer suggested while demonstrating practical instantiation of our framework. Next are the full experimental details below, which we will include in the camera-ready version.
Experiment details: We have conducted experiments demonstrating that our framework can be practically instantiated. We trained 6.3M parameter transformers to compute and prove the correctness of GCD computations on pairs of integers up to , following Charton's [1] setup.
Key results:
- Transcript Learning (TL): Achieved 60.3% verifiability (correctly proving answers) after training for 100k steps.
- RLVF: Improved a base model with 40% verifiability to 79.3%, albeit in significantly more (4M) iterations; while this is not an experimental paper, we will continue to explore a practical speedup by searching the hyperparameter space.
- TL + Annotations: Reached 96% verifiability by training on proofs annotated with intermediate Euclidean algorithm steps (similar to chain-of-thought); importantly, evaluation uses only the final proof.
Our proof system relies on Bézout's theorem. Given inputs and , to claim that , the model must output coefficients . The verifier then checks that divides both and , and that .
Ablations show that (1) more annotation steps improve verifiability monotonically, (2) even when trained with annotations, the model generalizes beyond the amount of annotations (i.e., it is not just memorizing the chain-of-thought, but exhibits length generalization), and (3) bases with more prime factors improve both correctness and verifiability—in line with Charton's findings.
[1] Francois Charton. Learning the greatest common divisor: explaining transformer predictions. ICLR 2024.
The paper proposes the idea of Self-Proving models - which prove interactively the correctness of their output with a verifier. This gives guarantees for a given input c.f. the typical average over a distribution of accuracy guarantees. Two methods are proposed for learning such models - one via access to transcripts of verifications and the other via RL through emulation of interaction with a verifier.
优缺点分析
The paper introduces a new concept of Self-Proving models, proposed two methods for learning such models, and considers convergence bounds.
One concern might be the level of significance/applicability of the work. Whilst theoretically solid, the practical implications of this work I feel would be limited. For example, considering the related area of models that prove general formal theorems. This is notoriously hard, and it seems unlikely that one could train a model with practical confidence parameters in any non-trivially applicable scenario. (The paper seems to push this question out of scope: "the rich and well studied question of which settings are verifiable within an interactive proof system is beyond our scope)
In summary, I think this is an interesting framework that has been created, but I am left wondering if it can be instantiated in any meaningful way.
Strengths:
- well-written with sound and well-presented theorems/lemmas
- thorough and illumination discussion of the motivation and concepts
- the ideas re backward data generation and the idea of bootstrapping RLVF with TL are good
Weaknesses:
- concerns re practicality/applicability of framework (per above)
问题
-
Perhaps a naive question, but how does this work sit with Godel, e.g. the existence of true statements which are not probable would seem to present an issue for the claims of completeness in practise.
-
In definition 2.2, is the word "honest" redundant? ie there exists a P* s.t. ....is surely sufficient
局限性
Yes
最终评判理由
Though the proposed additional experiment is appreciated, I think my underlying feeling still stands that the scope of relevance of this framework is typically quite limited. I will keep my score.
格式问题
N/A
Thank you for your thoughtful review and for recognizing our framework as interesting and theoretically sound. We appreciate your acknowledgment of our well-presented theorems, thorough motivation, and novel ideas. Your questions about practical applicability and foundational limits are insightful and help clarify important aspects of our work.
Regarding concerns
Whilst theoretically solid, the practical implications of this work I feel would be limited. For example, considering the related area of models that prove general formal theorems. This is notoriously hard, and it seems unlikely that one could train a model with practical confidence parameters in any non-trivially applicable scenario.”
We share your view that general theorem proving is notoriously difficult. However, our framework applies to a much broader class of problems than just "trivial" mathematics. Many computational tasks admit efficient interactive proofs while remaining tractable—from arithmetic operations to graph properties to optimization problems. To demonstrate practical instantiation, we have conducted experiments on GCD computation with Bézout coefficient proofs. Our 6.3M parameter transformers achieved 60.3% verifiability with basic Transcript Learning, 79.3% with RLVF, and 96% with annotated training. While GCD is admittedly simpler than general theorem proving, it demonstrates that self-proving models can be trained effectively for non-trivial mathematical tasks. This provides a foundation for exploring more complex applications in future work. We will include analysis (with figures) in the camera-ready version and release the model and code publicly. We note that the code is generic, so that larger transformers can be trained for more involved proof systems by those who have the computational resources.
Regarding questions
Gödel's incompleteness: Not at all naive, this question is an important one to ask in the context of any Proof system! Informally speaking, you are right to say that there is no single Interactive Proof system that is complete with respect to “all True statements.” Instead, one should think about specific problems (i.e., specific notions of True statements) for which there is an IP system. Then, our theory studies whether it is possible to learn an approximately-honest Prover. Indeed, there are problems for which an IP system cannot exist; but on the other hand, there is an extremely rich family of problems (anything solvable in polynomial space) for which an IP system does exist (as shown by Shamir 1992).
"Honest" terminology: Thank you for pointing out this potential confusion. We've clarified the definition to make it clearer that we're defining what "honest" means. We now write: "If there exists a prover P* such that [conditions]... Such a prover is called an honest prover." This should make it evident that "honest" is the term being defined, not a redundant qualifier.
Experiment details: We have conducted experiments demonstrating that our framework can be practically instantiated. We trained 6.3M parameter transformers to compute and prove the correctness of GCD computations on pairs of integers up to 10^4, following Charton's [1] setup.
Key results:
- Transcript Learning (TL): Achieved 60.3% verifiability (correctly proving answers) after training for 100k steps.
- RLVF: Improved a base model with 40% verifiability to 79.3%, albeit in significantly more (4M) iterations; while this is not an experimental paper, we will continue to explore a practical speedup by searching the hyperparameter space.
- TL + Annotations: Reached 96% verifiability by training on proofs annotated with intermediate Euclidean algorithm steps (similar to chain-of-thought); importantly, evaluation uses only the final proof.
Our proof system uses Bézout's theorem: the model must output coefficients a,b such that , which the verifier can use to easily check (with certainty) that the model's output is indeed the GCD.
More nuanced results show (1) more annotation steps improve verifiability monotonically, (2) even when trained with annotations, the model generalizes beyond the amount of annotations (i.e., it is not just memorizing the chain-of-thought, but exhibits length generalization), and (3) bases with more prime factors improve both correctness and verifiability—in line with Charton's findings.
While GCD is simpler than general theorem proving, it provides a concrete proof-of-concept that self-proving models can be trained in practice. We are happy to provide full experimental details and will include them in the camera-ready version.
[1] Francois Charton. Learning the greatest common divisor: explaining transformer predictions. ICLR 2024.
Thank you for your responses. The description of the experimental results is appreciated. Just to clarify, you say that the model is trained to output a and b as well as the putative GCD=c, say. Of course, a, b and c satisfying Bezout's identity is necessary though not sufficient for c to actually be the GCD. Yet you claim this can serve as the verification algorithm?
Though the experiment (modulo the clarification above) is appreciated, I think my underlying feeling still stands that the scope of relevance of this framework is typically quite limited.
The GCD is the only common divisor of and that satisfies Bezout's identity. So, for inputs and , the prover sends as you correctly said. The verifier then checks that divides both and , and that Bezout's identity is satisfied.
Thank you for requesting this clarification; this important detail will be made clear in the full write-up.
Thanks. Yes, it needed that extra condition re being a divisor.
This paper introduces the concept of self-proving models, a class of sequence-to-sequence models designed to interact with a specified verifier via an interactive proof protocol. Two learning algorithms are proposed for training such models:
- Transcript Learning Algorithm: This method trains the model using a ground-truth dataset of interactive proofs generated by an honest prover interacting with the verifier.
- RLVF Algorithm: This approach employs reinforcement learning, where the training data is generated by the model itself and validated by the verifier.
Under certain assumptions, the authors prove the convergence of both algorithms.
优缺点分析
Strengths:
- The paper addressed an interesting challenge of the worst-case guarantee of probabilistic models.
Weaknesses:
- The setting proposed is overly abstract. The self-proving model is defined in terms of an unspecified probabilistic model and an abstract verifier, without any structural constraints. Given that the general interactive proof setting includes problems as hard as IP or PSPACE-complete, the current abstraction leaves little room for meaningful theoretical insight or practical application.
- The proposed learning algorithms are relatively straightforward, and the lack of empirical evaluation significantly reduces their practical interest.
- The convergence guarantee in Theorem 4.1 relies on strong and arguably unrealistic assumptions, including parameter convexity and Lipschitzness, which limits the practical relevance of the result.
- The technical novelty is limited. Key aspects of the convergence proof rely heavily on prior results by Shalev-Shwartz and Ben-David [2014], with little original contribution.
问题
N/A
局限性
- The high level of abstraction in the problem formulation restricts both the theoretical implications and the potential for real-world applications.
最终评判理由
The authors present an empirical study on generating proofs for GCD. While the work is interesting, I see a fundamental limitation in the proposed architecture: the model is trained to produce the final answer before generating the proof. Although this ordering resembles the definition of an interactive proof system, it effectively removes the chain-of-thought reasoning process from answer generation. As a result, the model’s reasoning capacity is constrained to AC0 [a], i.e., problems solvable by constant-depth circuits.
Given that reversing the generation order can significantly enhance the model’s problem-solving ability, I believe the formulation in the paper should be substantially revised to address this architectural limitation. I will therefore keep my score unchanged.
[a] Li, Zhiyuan, et al. "Chain of thought empowers transformers to solve inherently serial problems." arXiv preprint arXiv:2402.12875 (2024).
格式问题
N/A
Thank you for your review and for acknowledging that our paper addresses an interesting challenge regarding worst-case guarantees of probabilistic models. We appreciate your detailed feedback and address each of your concerns below.
Regarding abstraction and PSPACE-completeness: As the reviewer notes, IPs can be used to prove PSPACE-complete problems. However, this does not preclude designing very efficient IPs for tractable computational problems or classes of computations. Moreover, even if a problem is intractable in the worst case, this does not preclude constructing efficient honest provers that perform well on real-world inputs (i.e. convince the verifier on average-case inputs). In terms of theoretical insights, the literature on IPs has produced a treasure trove of such insights (e.g. the literature on ZK proofs and doubly-efficient proof systems), even though the same objection (that IPs include PSPACE complete problems) could be raised there. This is the case even though the focus there is on worst-case guarantees (for the honest prover), let alone in the ML setting where we focus on average-case guarantees (for the honest prover).
Regarding empirical evaluation: We have conducted comprehensive experiments demonstrating practical instantiation. We trained 6.3M parameter transformers to compute and prove GCD correctness using Bézout certificates. Results: Transcript Learning achieved 60.3% verifiability, RLVF improved models from 40% to 79.3%, and Annotated TL reached 96% verifiability. Crucially, models generalize beyond their training—proving GCDs that require more steps than seen during training. These experiments validate that our algorithms produce non-trivial learning behavior in practice. We provide additional experimental details following the rebuttal.
Theorem 4.1 relies on strong and arguably unrealistic assumptions.
While autoregressive neural networks have non-convex loss landscapes, analyzing convex settings remains valuable in ML theory. Convex analysis provides clean mathematical tools for building intuition and establishing foundational results—a standard approach in theoretical ML. Proving convergence without convexity is the subject of cutting-edge research. Following common practice, we prove results under simplifying assumptions and complement them with experiments showing convergence even when these assumptions don't hold. We note that the same algorithm for transcript learning is used in the experiments as in the theorem. Our experiments confirm practical convergence of transcript learning on non-convex transformer architectures.
The technical novelty is limited. Key aspects of the convergence proof rely heavily on prior results by Shalev-Shwartz and Ben-David [2014], with little original contribution.
The technical novelty of this paper is in contributing a formal model for verifying correctness on every instance connecting a rich body of literature on interactive proofs to the area of generative models (as well as in the proofs of the convergence theorems on the use of transcript learning). Many impactful theoretical works have been first introduced in “definitional” papers; with a robust definition at hand, follow up works can yield general theorems. A case in point is the original work on “Zero Knowledge Interactive Proofs” where the first paper introduced the model and one example of zero-knowledge proof for one specific problem (distinguishing quadratic residues from non-residues) and shortly after in follow up work it was shown how to get zero-knowledge proofs for all of NP. Furthermore, using theoretical notions such as debate systems has improved the accuracy of generative models significantly; similarly, our self proving models could potentially lead to improved accuracy guarantees as well.
Experiment details We have conducted experiments demonstrating that our framework can be practically instantiated. We trained 6.3M parameter transformers to compute and prove the correctness of GCD computations on pairs of integers up to 10^4, following Charton's [1] setup.
Key results:
- Transcript Learning (TL): Achieved 60.3% verifiability (correctly proving answers) after training for 100k steps.
- RLVF: Improved a base model with 40% verifiability to 79.3%, albeit in significantly more (4M) iterations; while this is not an experimental paper, we will continue to explore a practical speedup by searching the hyperparameter space.
- TL + Annotations: Reached 96% verifiability by training on proofs annotated with intermediate Euclidean algorithm steps (similar to chain-of-thought); importantly, evaluation uses only the final proof.
Our proof system uses Bézout's theorem: the model must output coefficients a,b such that , which the verifier can use to easily check (with certainty) that the model's output is indeed the GCD.
More nuanced results show (1) more annotation steps improve verifiability monotonically, (2) even when trained with annotations, the model generalizes beyond the amount of annotations (i.e., it is not just memorizing the chain-of-thought, but exhibits length generalization), and (3) bases with more prime factors improve both correctness and verifiability—in line with Charton's findings.
While GCD is simpler than general theorem proving, it provides a concrete proof-of-concept that self-proving models can be trained in practice. We are happy to provide full experimental details and will include them in the camera-ready version.
[1] Francois Charton. Learning the greatest common divisor: explaining transformer predictions. ICLR 2024.
In the setting of this paper, the prover (i.e., the self-proving model) is required to output the final answer before the first round of the verification process (as illustrated in Figure 1). Given this constraint, how can chain-of-thought reasoning be meaningfully incorporated, especially since such reasoning typically unfolds during the generation of the answer rather than before it? Specifically, if the final answer must be committed in advance, how does the model benefit from or make use of intermediate reasoning steps?
We thank the reviewer for this important question. You're right to note that in our experiments, we apply chain-of-thought to the proof generation, not to generating the initial answer . This might seem counterintuitive since is already committed before the proof begins.
The key point is that in our paper the model must not only to generate the correct , but also generate a valid proof for . Crucially, in our setting, computing the proof is (empirically) much harder than computing the answer itself: Following Charton, our small transformer generates the correct answer >98% of the time, but generates the correct Bézout coefficients ~60% of the time after 100k steps of Transcript Learning. Concretely, in our GCD experiments:
- The model first commits to an answer .
- During the interactive proof, the model must generate Bézout coefficients that prove
- Chain-of-thought helps by letting the model work through intermediate steps of the extended Euclidean algorithm before committing to the final Bézout coefficients
- These intermediate computations are empirically helpful for generating valid proofs (~60% success without CoT vs 96% with CoT)
To clarify the mechanism more generally: When generating proof step , the model actually produces an annotated answer that includes intermediate reasoning. An extractor then strips away the reasoning to get just the final proof step that is sent to the verifier (this extractor is a very simple function, such as trimming the first few tokens). The model benefits from this reasoning because it was trained on annotated transcripts that included these intermediate steps, but these are not sent to the verifier during the interaction.
We are more than happy to follow-up with a more formal description of how annotations work if that would help!
The authors present an empirical study on generating proofs for GCD. While the work is interesting, I see a fundamental limitation in the proposed architecture: the model is trained to produce the final answer before generating the proof. Although this ordering resembles the definition of an interactive proof system, it effectively removes the chain-of-thought reasoning process from answer generation. As a result, the model’s reasoning capacity is constrained to AC0 [a], i.e., problems solvable by constant-depth circuits.
Given that reversing the generation order can significantly enhance the model’s problem-solving ability, I believe the formulation in the paper should be substantially revised to address this architectural limitation. I will therefore keep my score unchanged.
[a] Li, Zhiyuan, et al. "Chain of thought empowers transformers to solve inherently serial problems." arXiv preprint arXiv:2402.12875 (2024).
We appreciate the reviewer's continued engagement with our work. However, this latest comment appears to reflect a misunderstanding of our contribution. We are happy to clarify, though we note that this important point was raised very late in the discussion period, limiting the opportunity for deeper exchange.
Our paper proposes a definitional framework, not a neural architecture. We define (in Def. 2.4) how any sequence-to-sequence model (transformers, RNNs, future architectures) can interact with a trusted verifier to prove correctness. The Li et al. paper about AC⁰ limitations of transformers addresses a different topic: it analyzes specific architectural constraints, while we propose a model-agnostic verification framework.
Regarding the AC⁰ argument. In interactive proof systems, limits on the verifier's computational class do not impose any limits on the prover; the two are conceptually independent. Our framework does not constrain the prover's computational complexity—that depends on the particular problem for which proofs are generated, the input distribution (hence “Average Case”), the optimization landscape (see Theorem 4.1), and so on. Using a CoT-enabled transformer within our framework was simply a design choice in our follow-up experiments, and is unrelated to our main contributions.
The -first ordering is a syntactic choice. One could just as well output at the end rather than the beginning. This is a presentational choice aligning with standard interactive proof definitions, not a limitation.
We hope this clarification helps distinguish our contribution---a theoretical framework connecting interactive proofs to machine learning, enabling worst-case correctness guarantees for average-case ML models---from the architectural constraints considered in the Transformer Theory literature mentioned by the reviewer.
This work focuses on the correctness of LLMs according to certain inputs. To do so, it proposes an approach via interactive proofs named self-proving models. In this model, LLMs can generate the answer with proofs, and users are able to verify and be convinced that answer and proofs. It also proposes transcript learning and reinforcement learning from verifier feedback to learn SPMs.
优缺点分析
Strength:
- This work carefully defines notions of correctness and soundness in LLMs with interactive proofs as well as the verifiability.
- Other than worst-case scenario, this work also discusses the average-case scenario.
- This work provides algorithms to learn SPMs with convergence of transcript learning.
Weakness:
- This work lacks of empirical results, even though this work focuses mainly on theory. However, experimental results are more convincing even with a simple ones.
- This work hides too many details in the appendix like related works and proofs of lemmas.
问题
- Have you considered to extend this work to support ZKPs? ZKPs provide properties in IPs as well as zero-knowledge property to make the system secure if the prover with LLMs is untrusted.
- What is the convergence guarantee for RLVF? I know you mentioned this in Remark 4.2. Can you provide more explanation (You may provide this in appendix if space is tight)?
局限性
N/A
最终评判理由
Since the authors will add empirical results for the work, I will keep my score.
格式问题
N/A
We sincerely thank you for your positive and constructive review. We appreciate your recognition of our careful definitions of correctness and soundness, our treatment of both worst-case and average-case scenarios, and our convergence analysis for transcript learning. Your feedback has helped us significantly improve the paper.
Regarding concerns
Empirical results: We will add experiments demonstrating the practical viability of our framework. We trained 6.3M parameter transformers to compute and prove the correctness of GCD computations. Key results include: Transcript Learning achieving 60.3% verifiability, RLVF improving a model from 40% to 79.3% verifiability, and Annotated TL reaching 96% verifiability. Our ablation studies reveal that models generalize beyond their training annotations and that representation choices significantly impact performance. Appended to this rebuttal are additional details. We will include analysis (with figures) in the camera-ready version and release the model and code publicly.
Paper organization: We will move the related works into the body of the paper. We follow the conventions of theory papers wherein theorems are formally stated in the body; proof overviews are given in the body, and the full details are deferred to the appendix.
Regarding questions
ZKP extension: That is an extremely interesting direction for future research. We note that already the current setup (IPs) has a secure system even when the LLM-prover is untrusted (this is the soundness property). ZK would guarantee that, an LLM-prover would not leak anything (e.g. sensitive training data) to the verifier. Thank you for encouraging this direction.
RLVF convergence: Unlike supervised learning, there is no "standard" set of assumptions for RL convergence theorems. Rather than arbitrarily choosing specific assumptions, we prove a Gradient Approximation Lemma from which convergence guarantees can be derived in particular settings. This mirrors the approach for, e.g., Policy Gradient methods, where the Policy Gradient Theorem (which is a gradient estimation lemma) provides the foundation from which specific algorithms derive their guarantees. See, for example, Chapter 3 of Reinforcement Learning: Theory and Algorithms by Agarwal, Jiang, Kakade and Sun (2022).
Experiment details: We have conducted experiments demonstrating that our framework can be practically instantiated. We trained 6.3M parameter transformers to compute and prove the correctness of GCD computations on pairs of integers up to 10^4, following Charton's [1] setup.
Key results:
- Transcript Learning (TL): Achieved 60.3% verifiability (correctly proving answers) after training for 100k steps.
- RLVF: Improved a base model with 40% verifiability to 79.3%, albeit in significantly more (4M) iterations; while this is not an experimental paper, we will continue to explore a practical speedup by searching the hyperparameter space.
- TL + Annotations: Reached 96% verifiability by training on proofs annotated with intermediate Euclidean algorithm steps (similar to chain-of-thought); importantly, evaluation uses only the final proof.
Our proof system uses Bézout's theorem: the model must output coefficients a,b such that , which the verifier can use to easily check (with certainty) that the model's output is indeed the GCD.
More nuanced results show (1) more annotation steps improve verifiability monotonically, (2) even when trained with annotations, the model generalizes beyond the amount of annotations (i.e., it is not just memorizing the chain-of-thought, but exhibits length generalization), and (3) bases with more prime factors improve both correctness and verifiability—in line with Charton's findings.
While GCD is simpler than general theorem proving, it provides a concrete proof-of-concept that self-proving models can be trained in practice. We are happy to provide full experimental details and will include them in the camera-ready version.
[1] Francois Charton. Learning the greatest common divisor: explaining transformer predictions. ICLR 2024.
I think the empirical results are good, and thank you for including them in the final submission. Also, you clarified the RLVF convergence issue.
We thank all reviewers for their constructive engagement, which helped improve the paper content and presentation. Discussions were carried out in two threads:
1. Theoretical Clarifications:
- Scope and completeness (FMTQ, AHTf): We clarified that while IPs can express PSPACE-complete problems, this doesn't preclude efficient proofs for tractable on average. Our framework focuses on average-case performance (where ML excels) rather than worst-case guarantees. Regarding Gödel's incompleteness (AHTf), we explained that our framework applies to specific problems with IP systems (anything in PSPACE per Shamir 1992).
- Verifier construction (p6mv, AHTf): We gave a concrete, simple demonstration of verifiers using GCD with Bézout certificates.
- RLVF convergence (KBcF): We explained our gradient approximation approach, which follows standard RL methodology (e.g., Policy Gradient Theorem) rather than imposing arbitrary assumptions.
- Chain-of-thought integration (FMTQ): We clarified how CoT enhances proof generation.
2. Empirical validation: Following suggestions for experiments in simplified/arithmetic settings, we added an empirical evaluation on proving correctness of GCDs (following Charton, ICLR 2024 spotlight). We found:
- Transcript Learning: 60.3% verifiability
- RLVF: 40% → 79.3% improvement
- TL with annotations (CoT): 96% verifiability
- Models length-generalize beyond training data
These experiments provide empirical validation, a concrete example , and a modular codebase for training Self-Proving models. We will add the full results to the camera-ready, and release the codebase, data and models.
Review outcomes: KBcF and AHTf confirmed satisfaction with our responses. p6mv initially rated 5 (accept) but has not yet engaged with our rebuttal. FMTQ maintains a reject, raising last-minute concerns about the circuit complexity of transformers; this seems to stem from a misunderstanding, as our theory makes no architectural assumptions about the prover (in fact, the word "transformer" appears only once in the body of our paper). We attempted to clarify this as soon as we saw this final comment, and believe that had this comment been made earlier we could have reached a mutual understanding.
This paper proposes a framework of designing (LLM) models that can argue about the correctness of their decision to convince a (weak) verifier. The paper proposes a definitional framework a la interactive proofs. It then proposes two methods for training such models: one is based on “transcript learning” and the other based on “reinforcement learning” (designed for interactive proofs).
The paper does not have empirical experiments, which was brought up by the majority of the reviewers and was considered a major weakness. During the rebuttal, the authors report that they can implement and get initial results for the (simple problem of) GCD. Different reviewers valued this (potential) contribution differently; some considered it enough to make the paper acceptable and some still saw it as a toy example. The paper’s methods also build upon previous works and hence its main contributions are definitional.
At the end, the problem tackled by the paper is very timely and relevant to how LLM’s reasoning is playing out in practice. However, the practical side was questioned by the reviewers and hence the paper would definitely get a boost from adding such experimental results. At the end, the paper is borderline, but due to its theoretical merit can be a nice addition to the program if there is room.