PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.0
置信度
创新性2.8
质量2.5
清晰度2.8
重要性2.5
NeurIPS 2025

Logic.py: Bridging the Gap between LLMs and Constraint Solvers

OpenReviewPDF
提交: 2025-04-28更新: 2025-10-29
TL;DR

Using Formal Reasoning Tools in RL and at Inference Time to improve Reasoning Capability

摘要

关键词
formal reasoningmachine learninglarge language modelsconstraint solvers

评审与讨论

审稿意见
4

This paper introduces logic.py, a domain-specific language designed to bridge the gap between large language models (LLMs) and constraint solvers. Specifically, LLMs are prompted to formalize problems into logic.py, which are then solved by constraint solvers. The introduction of this intermediary language marks a significant advancement, as demonstrated by benchmarks on ZebraLogicBench.

优缺点分析

Strengths

  • Exploring the necessity of a domain-specific language (DSL) can be quite intriguing, as it offers a new perspective for bridging large language models (LLMs) and constraint solvers.
  • The summarized criteria for defining a DSL are well-founded and can inspire future research.
  • The empirical results seem to show potential when compared to the baselines.

Weaknesses

  • Sections 1 and 2 of the paper lack a clear structure, making it difficult to follow the subsections and connect them to the problem being addressed.
  • The current framework does not adequately resolve error feedback. Additionally, further exploration is needed on how to interact with the model to fix these errors.
  • The evaluation is insufficient. Including more benchmarks and more baselines (e.g., [1, 2]) would be beneficial for better illustrating the performance of the proposed method.

[1] Faith and fate: Limits of transformers on compositionality. Nouha Dziri et al. NeurIPS 2023.

[2] Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter? Nemika Tyagi et al. EMNLP 2024.

问题

  • The existing LLMs could be familiar with the Z3 API, z3py. What is the performance like when using z3py as a "DSL"?
  • More details regarding the evaluation are needed. For instance, what is the number of generations for each problem?
  • Have the authors analyzed the errors of the proposed framework on the benchmark? Specifically, how many problems failed due to syntax issues, and how many were due to incorrect formalization?

局限性

Yes

最终评判理由

My rating for this paper is a borderline accept. Although the paper successfully proves the effectiveness of symbolic system integration for logic problems, it would be significantly strengthened by a more comprehensive comparison with prior work.

格式问题

N/A

作者回复
  • Z3py is a Python wrapper that maps closely to SMT2 statements, but provides operator overloads that allow expressing certain constraints in a more Pythonic fashion. However, the formulas created are fundamentally still expressed over Z3’s domain, meaning that important Python features that Logic.py provides, like classes or even just repeated assignments to the same variable, are unavailable to the model. We would thus expect that without significant extensions in this regard, z3py as a DSL would not match Logic.py’s performance.
  • We limit the number of generations to at most 5 per problem. We collected data about restarts and observed that most problems can be solved in 1-3 attempts.
  • In our ZebraLogicBench benchmark run of 1000 samples, we observe 51 occurrences of unsatisfiable constraints across all attempts and tasks, and 70 occurrences of syntax errors in Logic.py. Of the 9.3% of puzzles where our approach did not produce an (entirely) correct solution, in 2.9% of cases we were not able to produce a solution at all in five attempts, due to the aforementioned syntax or unsatisfiability errors. In the remaining 6.4% of cases, Polymath produced a valid, but incorrect constraint and thus produced an incorrect solution.
评论

Thank you for your response; it has addressed all of my questions. I've updated my score to 4. However, I still recommend that more comparison and ablation experiments be conducted, as this would significantly strengthen the results.

审稿意见
4

This paper makes an attempt to address the lack of reasoning exhibited by large language models (LLMs) in the context of a particular kind of logic task. The task is given a list of facts about K objects and their properties, the model must determine which of the K objects matches which of the facts and properties. For this task, the authors develop a Python-like domain-specific language (DSL) Logic.py that describes the problem, which an LLM model is supposed to produce given the input problem description in a natural language. The Logic.py formulation is then translated either into C, which is handed either to a CBMC solver, or into an satisfiability modulo theories (SMT) formula to feed a generic SMT solver. Finally, either CBMC or an SMT solver produce a solution to the problem. This approach may require multiple iterations if the LLM fails to produce a correct Logic.py code. Experimental results obtained on the widely used set of benchmarks ZebraLogicBench (containing exactly the above kind of logic puzzles) demonstrate a significant performance boost for a few LLMs, including Llama 3.1 70B, GPT-4o, and Claude 3.5 Sonnet, when compared to the baseline LLMs without the Logic.py functionality.

优缺点分析

I would say the main strength of the paper is that it is well written. The motivation is reasonable and the flow is quite easy to follow. Another upside is the experimental results demonstrating a significant performance boost for the target class of logic tasks. The approach tackles the problem of the lack of reasoning in LLMs in an effective way by delegating the entire reasoning task to a dedicated solver, which is known to be effective for the considered class of problems.

Among the weaknesses is the simplicity of the proposed approach and its somewhat "hard-coded" nature. Indeed, the language Logic.py in its current form is very specific to the considered class of problems. Note that both CBMC and SMT are known to be effective for this really simple reasoning problems. In fact, this matching problem with a bunch of "all-different constraints" is trivial for them unless the size of the problem is huge. And so what is left is a simple translation of the problem into a DSL compilable to C or SMT, which I guess can be called the main technical contribution of the paper that makes the approach work.

What is somewhat disappointing is that the authors do not address the obvious issue of potential errors occurring here and there in the translation process. It is clear that no LLM is capable ot producing a correct Logic.py code all the time, and what the authors do in case of a failure is running the LLM multiple times from scratch until it produces correct code.

Another limitation of the approach is that it does not only target one particular type of problem but cannot work for this same type of problem if it does not have a solution. This is because in the case of unsatisfiability reported either by CBMC or SMT, the authors again restart the process from scratch, which is not very helpful if there is no solution at all.

Some more-or-less minor issues include the lack of benchmark description in terms of size and complexity, the lack of discussion on the scalability of the approach (in terms of the number of iterations taken from scratch till a solution gets produced), and a large number of typos in the text.

问题

  1. How can you generalise your approach to other types of logic puzzles? I am mostly interested in a general approach that would be able to handle a wide class of logic problems with a large variety of constraints.

  2. What can you say about the number of iterations your approach takes for the considered benchmarks (min, max, avg) if an LLM fails to produce reasonable Logic.py code?

  3. How do you plan to address the issue of unsatisfiability, i.e., if the problem has no solution? How do you plan to explain it to the user?

局限性

Yes

最终评判理由

While some of my concerns have been addressed by the authors in their rebuttal and the follow-up discussion, the major limitation of the approach in terms of its inability to handle problems with no solutions should be explicitly discussed in the paper and will perhaps require further significant efforts.

格式问题

N/A

作者回复
  1. The reviewer's observation regarding the inherent limitations of relying on formal reasoning engines is accurate. Firstly, our approach is constrained by the problem domain of these solvers - for example, SAT solvers operate within NP-complete problems. Secondly, even within this domain, the complexity and size of the constraints necessary for a puzzle or game may well exceed what the model can realistically express in Logic.py.

    Nevertheless, the SAT domain supported by CBMC encompasses a wide range of games and puzzles. And importantly, our method is not tailored specifically to Zebra Logic puzzles. We do not provide system prompts unique to Zebra puzzles; instead, we instruct the model to generate a generic "data structure" capable of representing the solution to a given logic puzzle. This data structure may include entities in search puzzles like Zebra Logic, or steps and moves in game-based puzzles. Similarly, the model is only tasked to express the "constraints or rules" that define a correct solution to the puzzle or game.

    We selected FOLIO as a second benchmark primarily to demonstrate that our engine can scale to different types of solvers as well. In particular, quantifier alternations could not be easily mapped to CBMC’s SAT domain. However, FOLIO also serves as an example of how this “data structure and constraint”-based approach can generalize to a substantially different type of puzzle.

    Regarding scalability to new solvers and domains, our DSL acts as an intermediate protocol that facilitates the integration of new reasoning engines into the agent. For instance, an internal team is already employing a Polymath-like agent for a real-world support use case, leveraging a solver optimized for process flow constraints.

    Of course, decision problems beyond NP remain challenging in both formal and informal contexts. But overall, these points highlight the flexibility and extensibility of our approach beyond any single puzzle type or solver domain.
  2. In our experiments we limit Polymath to at most 5 restarts per problem. We observe that the vast majority of problems in our benchmark set can be solved in 1-3 attempts. Most easy and medium puzzles are solved in a single attempt, and in 2.9% of cases we cannot produce a syntactically valid formalisation even after 5 attempts.
  3. Unsatisfiability (no valid solutions according to the model’s constraints) and ambiguity (multiple valid solutions according to the model’s constraints) are two signals that both our SAT and SMT engine implementations can provide for any puzzle. Both may be valuable information to relay back to the model for refinement, or ultimately as an answer to the user. The reviewer is correct in pointing out that we did not address leveraging UNSAT as a signal in this work, simply because it is not a valid outcome for the particular benchmarks we evaluated against.

    However, we did discuss in the paper how ambiguity feedback could be leveraged, and the strategy presented can be applied to unsatisfiability as well. Our current suggestion for ambiguity is to inform the model that the current constraints allow for multiple options, and let it decide in an agentic manner based on the input puzzle whether this is an acceptable or likely outcome. If yes, this is treated like any other result and reported to the user, otherwise it is grounds for a refinement / restart step.

    Fundamentally, we delegate to the model the responsibility of defining the problem’s domain via a Python data structure. For a production grade agent handling real-world use cases, we expect that developers may choose to either explicitly configure whether UNSAT is an option, or delegate this decision to the model as a part of the domain description process.
评论

Thank you for providing answers to my some of comments.

  1. Unless I am missing something and despite your claims, both of your benchmarks study the same exact problem although the problem instances originate from two different sources. Namely, you solve the matching problem where the AI is supposed to correlate the object under question with the properties discussed in the puzzle formulation. So the apparatus you propose is still limited to that particular reasoning task and cannot do any other kind of reasoning. For example, using your approach, one cannot solve a sequence / order puzzle, one cannot solve a grid puzzle (say sudoku), one do deduction / inference problems, one cannot solve pattern recognition puzzles, one cannot do spacial / visual reasoning, and so on and so forth. Please correct me if I am wrong.

  2. Thank you. I am quite-quite surprised to know that LLMs on average require only 1-3 attempts. In either case, this information must be present in the paper.

  3. Unsatisfiability is different from having multiple solutions. Unless I fail to understand the approach, there is no way to differentiate a non-sensical formulation of the puzzle generated by an LLM, which requires another attempt, from a perfectly valid formulation generated by an LLM for the problem with no solution. In both cases, a reasoner will report UNSAT. So I don't easily see how this can be taken into account in a simple feedback loop like yours.

评论

Thank you for the effort in clarifying on my earlier concerns. I respond to your comments below:

  1. I am glad to see that you can (I guess) hard-code Polymath to support these additional types of puzzles. In either case, the toolkit will obviously still unable to handle any kind of reasoning problem despite your reference to 3-SAT's NP-hardness. The reductions to SAT often aren't trivial and require mathematical thinking, which is far beyond the reach of automated solutions. Although I am happy to accept that your propose an initial solution, to be extended with additional capabilities in the future, the point is that your work does not propose a way to reduce any kind or problem expressible in natural language to a logical formalism to be tackled by a reasoning engine. Currently, it is a simple translator that can deal with a particular type of puzzle (multiple puzzles if you implement the above). So, as I see it, the toolkit will always be limited by a "list of problems" it can handle.

Also, related to this, the question is whether we should accept your work based on its current (automated) capabilities to support one kind of puzzle, or its potential of further improvement (involving much more of extra engineering). The answer isn't obvious to me.

  1. Thank you.
  2. I don't think I understand. If you have a reasoning problem that does not have a solution, no matter what you try to do in a feedback loop, it still does not have a solution. So, unless you have a mechanism to detect it once and for all, you will keep iterating: the LLM produces a formulation and the reasoning engine reports UNSAT. You need to be able to detect that the original problem (not its particular LLM's reformulation) is unsatisfiable in order stop the process.
评论
  1. Perhaps there is some misunderstanding about the quick experiment mentioned in the previous comment. We did not need to "hard-code" anything to, for example, run a sequence puzzle against Polymath. We only needed to change the user input puzzle. Here is our experiment code:
puzzle: str = """This is an arithmetic sequence puzzle. Fill in the missing variables such that each line forms a sequence of numbers, where the arithmetic difference between two numbers in the same line is the same.
A, B, 21
A, 7, C, 15
D, C, E, 21
"""
# No benchmark eval, so any output format is fine.
output_format: str = "Output it in plain text for the user to read."

result_trace = ResultTrace("0")  # No benchmark, just use task ID 0
log_stream = StringIO()
with LoggerFactory(log_stream, True) as logger_factory:
    engine_strategy: EngineStrategy = CBMCSearchEngineStrategy(
        logger_factory, puzzle, output_format
    )
    async with create_chat_completion(
        logger_factory, "llama3-70b-instruct"
    ) as chat_completion:
        solver = LogicAgent(
            logger_factory, chat_completion, engine_strategy, result_trace
        )
        await solver.solve()

print(result_trace.solution)
# Output:
# A = 3
# B = 12
# C = 11
# D = 6
# E = 16

For ZebraLogicBench, the puzzle would just be a different input text, for example:

There are 5 houses, numbered 1 to 5 from left to right, as seen from across the street. Each house is occupied by a different person. Each house has a unique attribute for each of the following characteristics:
- Each person has a unique name: `Peter`, `Alice`, `Bob`, `Eric`, `Arnold`
- The people are of nationalities: `Norwegian`, `German`, `Dane`, `Brit`, `Swede`
- People have unique favorite book genres: `fantasy`, `biography`, `romance`, `mystery`, `science fiction`
- Everyone has something unique for lunch: `stir fry`, `grilled cheese`, `pizza`, `spaghetti`, `stew`
- Each person has a favorite color: `red`, `green`, `blue`, `yellow`, `white`
- The people keep unique animals: `bird`, `dog`, `cat`, `horse`, `fish`
## Clues:
1. The person who loves fantasy books is the Norwegian.
2. The cat lover and the person who loves biography books are next to each other.
3. The German is Bob.
...

Of course, for ZebraLogicBench we would use an output_format text instruction such that we can automatically evaluate it against the benchmark suite, i.e., some JSON format, but that is irrelevant for this example.

To resolve this confusion, at which point in the process illustrated in Fig. 5 would you expect the need to hard-code or change anything to, for example, handle the above sequence puzzle? As demonstrated, this hard-coded change is not necessary, and we will definitely commit this example as a unit test to our repository.

In summary, Polymath, as provided in the supplementary material, already supports many types of puzzles out of the box. Even puzzles that we did not account for at all, like the sequence puzzles mentioned by the reviewer, work without any changes to the code, as demonstrated above. Polymath is limited by the incompleteness of our mapping from Logic.py to CBMC or Z3, but this limitation is not fundamental or linked to any particular puzzle; it is simply a matter of exhaustively implementing the Python mapping. And of course, as the reviewer correctly points out and as we mentioned in previous comments, there are puzzles with size and complexity that realistically exceed the model's capability to formalise in Logic.py. However, we are convinced that the above examples for ZebraLogicBench, sequence puzzles, and Sudoku already make a compelling case for the flexibility of this approach.

  1. The suggestion in the prior comment was that UNSAT could be treated as a regular result like any other. Polymath would not restart in case of UNSAT, and Polymath's output to the user would just be "This puzzle has no solution." This would, of course, always be an incorrect answer in the case of ZebraLogicBench.

    In the case of ZebraLogicBench and FOLIO, since we knew that all tasks in these benchmarks must have a solution, our experiments hard-coded this knowledge and triggered a restart on UNSAT. For certain domains where Polymath might be deployed, e.g., a support case process flow system, this can be a valid and reasonable user choice.

    In general, this could be modified in various ways: 1) By default, always assume that "no answer" is a possibility for the puzzle and remove the restart on UNSAT entirely. As mentioned previously, this would have negatively affected our puzzle score by at most ~5%, and we can report these numbers in the experimental section as well. To increase confidence, we could report "no solution" only if multiple attempts in a row yield this result. 2) Future work could treat "restart on UNSAT / ambiguity" as an optional mode and let the model decide whether to activate this mode, e.g., based on the puzzle description.
评论

OK, I admit that your answer to point 1 convinced me and you can perhaps solve more kinds of problems with this approach. But the inability of your setup to solve problems that have no solutions, without the serious loss of accuracy (or ability to deal with bogus LLM reformulations, which are also UNSAT) is in my view a serious limitation and it looks innate. I just can't seem to believe that LLMs can produce a genuine formulation, which we could trust, at their first go. I am convinced this limitation must be discussed in your paper if it gets accepted.

评论

Thanks again for your challenging and detailed questions. We are convinced that your feedback, including the additional puzzle suggestions you provided and that we are adding to our repository, already offers valuable context to this research project.

Regarding the topic of unsatisfiability, we do not agree that puzzles without solutions introduce fundamental limitations in our approach. To demonstrate this, let us assume we remove the UNSAT edge in Fig. 5. As explained earlier, we only added it because we know beforehand that all puzzles in our two benchmark sets must have a solution.

Without this UNSAT edge, consider two trivial constraint satisfaction puzzles, one with a solution and one without:

A (solvable):
Find an even integer `x` between 8 (inclusive) and 10 (exclusive).
(Solution: x = 8)

B (unsolvable):
Find an even integer `x` between 9 (inclusive) and 10 (exclusive).

The possible outcomes in Polymath for these puzzles are:

For puzzle A:

A1) The LLM formalises the puzzle correctly (e.g. 8 <= x < 10 and x % 2 == 0), and we output 8.

A2) The LLM formalises the puzzle incorrectly, resulting in an incorrect solution. For example, the LLM might produce the Python code 8 < x <= 10 and x % 2 == 0, yielding the incorrect answer 10.

A3) The LLM formalises the puzzle incorrectly such that we report no solution. For example: 8 < x < 10 and x % 2 == 0

For puzzle B:

B1) The LLM formalises the puzzle correctly (e.g. 9 <= x < 10 and x % 2 == 0), and we correctly report that there is no solution.

B2) The LLM formalises the puzzle incorrectly, resulting in an incorrect solution. For example: 9 <= x <= 10 and x % 2 == 0, yielding the incorrect answer 10.

The purpose of these two examples is to demonstrate that problems without a solution are not fundamentally different for our approach. Our approach inherently involves uncertainty: the LLM might produce a valid but incorrect formalisation of the problem. This applies both to puzzle A, which has a solution, and puzzle B, which does not.

We do not claim that our approach can guarantee a correct formalisation - irrespective of whether the puzzle has a solution or not. Our claim is that the formalisation process in Logic.py leads to better performance because, for many puzzles, formalising in Logic.py is an easier task for the LLM than solving the puzzle itself. We use the performance improvement of LLMs in ZebraLogicBench and FOLIO as evidence for this claim.

Based on our discussion of unsatisfiability so far, we suggest applying the following changes to our experimental section:

  1. Add the scores for ZebraLogicBench without the UNSAT edge. Our logs show this will lead to at most an absolute ~5% difference in score.
  2. Discuss that the difference between these two scores illustrates that if it is known beforehand that the puzzle must have a solution - as is the case for some real-world domains in which we operate - restarting on UNSAT will improve performance. We will also discuss that the same consideration applies for detecting ambiguity if the problems are known to have exactly one solution.
评论
  1. To illustrate how Polymath would generalise to other problems, we manually ran its prompts against an exemplary sequence puzzle and Sudoku puzzle. These were in image form, and the model used in our benchmarks is not multi-modal, so we asked a multi-modal model to convert these two puzzles to text such that a text-only model could understand them.

Here is the result for the Sudoku puzzle:

This is a partially filled 9x9 Sudoku puzzle:

9 _ _ 2 _ _ 5 _ _
_ 8 7 _ _  9 3 _ _
_ _ _ _ 3 6 1 _ 7
_ 4 _ _ _ _ 9 2 _
7 _ 6 4 9 _ 8 _ _
8 _ 1 5 6 2 4 7 3
4 _ _ 3 2 _ 6 _ 5
1 _ _ _ 8 _ _ _ _
_ 3 8 _ _ 7 _ 1 _

Fill out the blank `_` cells while adhering to the standard rules of Sudoku.

Running this text puzzle through the same Polymath prompt as used for Zebra Logic, we again first ask the model to define a data structure to hold the solution. The answer was:

class Cell:
  value: Domain[int, range(1, 10)]
class Row:
  cells: list[Cell, 9]
class Solution:
  rows: list[Row, 9]

Then Polymath asks to generate the constraints. In our running example, this was:

def validate(solution: Solution) -> None:
  # Enforce that each row contains unique values
  for i in range(9):
    cell1 = nondet(solution.rows[i].cells)
    cell2 = nondet(solution.rows[i].cells)
    assume(cell1 is not cell2)
    assert cell1.value != cell2.value
  # Enforce that each column contains unique values
  # ...
  # Enforce the given clues from the puzzle statement
  assert solution.rows[0].cells[0].value == 9
  assert solution.rows[0].cells[3].value == 2
  assert solution.rows[0].cells[6].value == 5
  # ...

The model used Python features in validate that we did not implement yet in our CBMC back-end, but we believe this demonstrates how such puzzles can be handled with Polymath.

Similarly, for the sequence puzzle:

This is an arithmetic sequence puzzle. Fill in the missing variables such that each line forms a sequence of numbers, where the arithmetic difference between two numbers in the same line is the same.

A, B, 21
A, 7, C, 15
D, C, E, 21

Data structure:

class Solution:
  A: Domain[int, range(1, 31)]
  B: Domain[int, range(1, 31)]
  C: Domain[int, range(1, 31)]
  D: Domain[int, range(1, 31)]
  E: Domain[int, range(1, 31)]

Constraints:

def validate(solution: Solution) -> None:
  # Line 1: A, B, 21 is an arithmetic sequence
  # The difference between B and A equals the difference between 21 and B
  assert (solution.B - solution.A) == (21 - solution.B)

  # Line 2: A, 7, C, 15 is an arithmetic sequence
  # The difference between 7 and A equals the difference between C and 7
  # The difference between C and 7 equals the difference between 15 and C
  assert (7 - solution.A) == (solution.C - 7)
  assert (solution.C - 7) == (15 - solution.C)

  # Line 3: D, C, E, 21 is an arithmetic sequence
  # The difference between C and D equals the difference between E and C
  # The difference between E and C equals the difference between 21 and E
  assert (solution.C - solution.D) == (solution.E - solution.C)
  assert (solution.E - solution.C) == (21 - solution.E)

The choice of domain (max value 31) is questionable for the solution data structure, but we believe this demonstrates how such puzzles are well within the capabilities of our approach. In fact, any problem in NP can be mapped to a 3-SAT formula, and our approach can of course represent 3-SAT problems:

class Solution:
  x1: bool
  x2: bool
  # ...
def validate(solution: Solution) -> None:
  assert (solution.x1 or solution.x2 or solution.x3) and (solution.x4 ...

We would of course not expect the model to produce correct 3-SAT formalisations - at least not for puzzles which are not already in that form. After all, the entire design intent behind Logic.py is to give the model a more expressive and less error-prone API to describe problems in NP.

  1. Thanks for the feedback, we will add these details to the experimental section. We were also surprised by these results and how coding capabilities seem to translate to a strong formalisation performance, given a safe DSL that leverages and reinforces these strengths.

  2. The observation that we cannot identify valid but incorrect formalisations is accurate. This is true irrespective of unsatisfiability. If the model creates a valid but incorrect formalisation, we will get an incorrect solution. This incorrect solution might be an incorrect assignment, or it might be an incorrect UNSAT response. We restart on UNSAT in our agent implementation because UNSAT is outside the codomain of valid responses for both our benchmarks. We could choose to make UNSAT always part of the codomain, or we could delegate that choice to the model, since the model decides the codomain anyway. We suggest something similar for ambiguity. Across all ZebraLogicBench puzzles and attempts, we only observed 51 instances of unsatisfiable formulas, so this would affect our puzzle score by at most 5.1%.

审稿意见
5

This paper presents Logic.py, a domain-specific language designed to power LLMs with constraint solvers. The proposed approach is a form of neuro-symbolic integration that is designed to target solving logic puzzles and search algorithm problems for which LLMs are known to be weak.

More concretely, the proposed approach first prompts LLMs to formalize problems in Logic.py --- a Python-based, human-readable DSL. This formalized representation is then solved using constraint solvers like CBMC or Z3.

This paper also presents a novel Polymath Agent Architecture that converts Logic.py code to C using libCST transformation and also includes error recovery mechanisms for syntax errors and unsatisfiable constraints.

Experimental Results look strong, particularly on ZebraLogicBench, when comparing against the base LLMs performance, e.g., Llama 3.1 70B improved from 24.9% to 90.7% accuracy, while GPT-4o improved from 31.7% to 77.8% and Claude 3.5 Sonnet improved from 36.2% to 76.9%. On the other hand, if compared against other SOTA models, then the final results are not as strong as them.

优缺点分析

Strengths:

  • The idea of symbolic system integration is neat and makes sense for logic problems that require constraint satisfaction.
  • The proposed language logic.py and the agentic system Polymath both seem interesting.
  • strong empirical results over the same size, same class base LLMs.

Weaknesses:

  • the applicability of logic.py seems relatively narrow.
  • not SOTA results compared to other prior works.

问题

  • regarding the error recovery process, the paper states that "We currently do not provide any information about the syntax error to the model". It's not clear to me why this design choice was made.

-tThe restart process isn't clearly defined - how many retries? What's the fallback?

  • this paper has a lot of typos, e.g., "constraint sovlers", "satisifes", "ZebraLogcBench", "translation315 based work", just to list a few

局限性

yes

格式问题

no concerns

作者回复
  • We share the reviewer’s opinion that providing generative error feedback to the model might further improve performance. We did not address this in this project, since we wanted to evaluate the fundamental impact of auto-formalisation on its own, and decided to leave generative feedback and other code repair strategies for future work.
  • If we encounter a recoverable error, we attempt to restart at most five times. We also explored fallback options like forwarding the original problem to the model in the case of a formalisation failure. However, in order to assess the impact of our formalisation, we did not apply any such fallback during our experimental evaluation.
  • Thank you for pointing out these typos, we already addressed them in our paper.
审稿意见
4

The paper presents Logic.py, a domain-specific language (DSL), and the Polymath agent to bridge LLMs with constraint solvers (e.g., CBMC, Z3) for logical reasoning tasks. Instead of direct LLM problem-solving, the approach prompts LLMs to formalize problems in Logic.py, which is then solved by solvers. Experiments on ZebraLogicBench and FOLIO show significant accuracy gains (e.g., Llama 3.1 70B improves from 24.9% to 91.4% on ZebraLogicBench).

优缺点分析

Strengths:

  • The paper demonstrates significant performance gains (up to 65% absolute improvement) on two distinct benchmarks (ZebraLogicBench and FOLIO), using well-known models (Llama 3.1, GPT-4o, Claude 3.5). Results are compared against multiple baselines, including state-of-the-art methods, providing robust evidence of the approach’s efficacy.
  • The paper follows a clear narrative and easy to read
  • The approach addresses a critical limitation of LLMs (logical reasoning) with tangible improvements (e.g., Llama 3.1 accuracy from 24.9% to 91.4%), enabling more reliable automation of logic-intensive tasks (e.g., puzzle-solving, formal verification).
  • The "LLM+DSL+solver" paradigm introduces a new framework for combining neural and symbolic AI, which could inspire similar approaches in math reasoning, program synthesis, or planning.

Weakness:

  • While the integration is novel, the idea of combining LLMs with solvers is not entirely new (e.g., Logic-LM, SatLM). The originality lies in the DSL-mediated approach, but the core concept builds on existing neuro-symbolic research.
  • The paper hypothesizes that DSLs can bridge LLMs and solvers (analogous to compilation), but this hypothesis is tested only in logic tasks, leaving its validity in other domains unsubstantiated.
  • Experiments focus solely on logic puzzles and first-order logic, leaving broader applicability (e.g., numerical computational programming languages or algebraic modeling languages) untested. The "Threats to Validity" section acknowledges this, but empirical validation in other domains is missing.

问题

  • The paper acknowledges Logic.py lacks support for complex Python features. Can you specify which features are missing (e.g., classes, generators) and how their absence limits problem formalization?
  • How does the pipeline's latency compare to direct LLM inference (e.g., time per puzzle for Polymath vs. Llama 3.1 alone)? Does the solver integration introduce bottlenecks for real-time applications?

局限性

yes

最终评判理由

My concerns about the paper have been addressed by the author's response. As my initial score was positive already, I just kept it for the final score.

格式问题

N/A

作者回复
  • At the Python language level, Polymath’s implementation is fairly complete and supports rich features like methods, classes, and inheritance, thanks to its being based on libCST. However, comprehensions and lambdas in particular are currently not implemented in our mapping to the back-end engines. Using these features will lead to spurious errors and restarts in Polymath, and we try to guide for this in the system prompt.

    At a library level, we do not attempt to map native Python library code to our back-end engines, safe for a few critical utility methods. Related work describes methods to derive such constraints from mixed-language environments, e.g. Gopan, D., Reps, T. (2007): Low-Level Library Analysis and Summarization; Nastaran Shafiei and Franck van Breugel (2013): Automatic Handling of Native Methods in Java PathFinder.

  • The Z3 and CBMC (SAT) engines themselves do not introduce measurable bottlenecks. The formulas produced for our two benchmark sets are orders of magnitude smaller than what these engines are optimised for. The formalisation by the model into Logic.py is actually more impactful here, since Logic.py is not optimised for brevity, but human readability. The average task formalisation (including restarts in case of failure) only takes 20-30s on our inference infrastructure, so this was not a concern in our experiments. However, real-time applications may want to hone their choice of DSL with this in mind.

最终决定

The reviewers believe the paper has value and presents an original approach for solving formal problems by formalizing them using an LLM and then solving the problem. The authors defended their position well.