PaperHub
4.9
/10
Poster4 位审稿人
最低2最高5标准差1.3
2
2
2
5
ICML 2025

Can Transformers Reason Logically? A Study in SAT Solving

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We theoretically show that Decoder-only Transformers can solve bounded 3-SAT instances with Chain-of-Thought

摘要

关键词
TransformersLogical ReasoningSAT-Solving

评审与讨论

审稿意见
2

This paper investigates whether transformers can solve 3-SAT problems by using COT reasoning to simulate backtracking-based search algorithms like DPLL. The authors theoretically demonstrate that this approach is feasible. Empirical evaluations show that: 1) language models using COT can be trained on reasoning paths to perform deductive reasoning effectively, and 2) the trained models struggle to generalize to problems with unseen sizes, such as different numbers of variables.

Update after Rebuttal

Thank you to the authors for the response and clarifications.

Despite the rebuttal, I still share the concerns of other reviewers regarding the lack of convincing evidence for the method's applicability in industry-scale settings. This limitation currently narrows the work's potential impact.

Consequently, I am maintaining my initial score.

给作者的问题

  1. For the results presented in Section 6.2, the authors argue that the failure to generalize to arbitrary lengths is consistent with their theoretical result, which states that the size of the transformer depends on the number of variables. However, this does not align well with the experimental design. The authors only experimented with varying problem sizes, rather than varying the sizes of the transformers themselves. Could the authors clarify this discrepancy?

  2. How does the efficiency of the transformer-based SAT solver compare to specialized SAT solvers in terms of computational resources and solving time? Or, how might the proposed method collaborate with existing SAT solvers to enhance SAT solving?

论据与证据

The claims are generally well-described within the context and supported by both theoretical and empirical evidence. The authors provide a formal proof that transformers can solve 3-SAT via COT reasoning, and demonstrate its correctness empirically. Their experiments show strong in-distribution generalization but limited generalization to different input lengths.

However, the authors could elaborate on the relationship between their theoretical construction and how LLMs learn in real-world environments, as their construction requires specific weight configurations that may not naturally emerge during training procedures.

方法与评估标准

The experimental setup suitable for the research questions. While the Turing completeness of chain-of-thought reasoning guarantees that a model can be constructed (likely with predefined parameters) to simulate DPLL, the more practical question is whether a model can learn to perform such operations from data (i.e., reasoning paths). The authors effectively designed and conducted experiments to investigate this question.

理论论述

While I have reviewed but not verified the proofs in detail, their construction appears sound.

It's noteworthy that the authors managed to introduce parallelism into the standard backtracking procedure, which is a key difference between COT and traditional rule-based reasoners. The parallel processing capability inherent in LLMs offers potential advantages over sequential rule-based systems.

实验设计与分析

They are relevant to the research problem. The authors designed their evaluation datasets to avoid statistical shortcuts, enhancing the credibility of their results.

补充材料

I skimmed through the appendix (i.e., pages 12-38) of the PDF. I appreciate the authors' effort in providing comprehensive details about their implementation.

与现有文献的关系

However, in my opinion, the ideas and findings have limited novelty, as they are, in some ways, a straightforward implication of existing literature. Since the Turing completeness of COT has been well discussed in previous research, it is not surprising for the community to see that COT can simulate a backtracking procedure like DPLL. From this perspective, I doubt whether this paper can contribute significantly to the research community.

In Section 2, the authors argue that their method requires a "drastically reduced number of COT tokens." It seems not to be a fair comparison, as existing literature focuses on single-tap TMs, which are typically used as a basis for theoretical analysis rather than actual tine-grained complexity analysis. Theoretically, it is reasonable to expect that the numbers can be reduced when considering the properties of COT, the 3-SAT problem, and DPLL. However, this advantage is unlikely to extend to practical use cases, as using COT to simulate DPLL would introduce much overhead compared to standard rule-based reasoners.

The experiments design largely overlap with other "learning to reason" experiments in previous literature, and the results are generally intuitive. The authors fail to provide an in-depth analysis of how the learned model relates to their theoretical construction.

遗漏的重要参考文献

No.

其他优缺点

None

其他意见或建议

In my opinion, the preliminary section could be rearranged, as readers at ICML are likely to be relatively familiar with transformers and can easily find additional materials if needed. Most of the content, particularly the formulas related to transformer blocks, is not utilized in subsequent sections. Instead, I recommend including a brief introduction to backtracking-based search algorithms, such as DPLL, which is used in the paper, to enhance readers' familiarity with the topic. The authors may also consider elaborating on the COT aspect as well.

伦理审查问题

N/A

作者回复

We thank the reviewer for their detailed feedback. We’re grateful for your acknowledgment of the soundness of our construction and the relevance of the experiments. We would like to focus on your primary concern regarding novelty compared to Turing Completeness (TC) results.

While we agree that TC of Transformers with CoT implies that, in principle, simulating algorithms like DPLL is possible, we believe this perspective overlooks the core contributions and significance of our work. Allow us to elaborate:

Beyond Turing Completeness: The "How" Matters

Our work, like previous works on Transformer expressiveness, aims at advancing our understanding of the capabilities and mechanisms of Transformer models in reasoning. Previous TC results rely on step-by-step simulations of single-tape Turing Machines (TM), which are theoretically foundational but don't well-explain how logical reasoning is performed internally in Transformers. I.e., Many reasoning phenomena of LLMs remain poorly explained by TC alone.

Consider an analogy: Mathematical theorem proving (e.g., solving IMO problems) can be framed as search problems over LEAN proofs. If LLMs are TC, does this make their success on math benchmarks unsurprising or non-novel? We argue no. The interest lies in how LLMs solve these problems – potentially using more efficient, structured reasoning mechanisms than TM simulation. Our work moves beyond TC by asking how a Transformer can efficiently simulate a specific, non-trivial logical reasoning procedure (DPLL for 3-SAT) using its inherent architectural features. We provide an explicit construction and demonstrate a concrete CoT mechanism.

Novelty in Parallelism via Attention

As highlighted in your review, a key novelty is demonstrating how Transformers use parallelism effectively. We provide the first theoretical evidence that the attention mechanisms in Transformers can support parallelism for logical reasoning. This suggests that the Transformer models can be particularly suitable for deductive logical reasoning in large contexts. In particular, Lemma 4.8 shows that a single Transformer head can perform satisfiability checking, conflict detection, and unit propagation, respectively overall all clauses in parallel.

Novelty in CoT Efficiency and Structure

While the reduction in CoT tokens compared to TM simulations might seem “reasonably expected”, anticipated theoretical results still significantly benefit from formal proofs—analogous to widely believed mathematical conjectures (e.g., P ≠ NP).

Our CoT simulates DPLL at an abstraction level higher than TM emulation, explicitly representing logical reasoning steps of assumption, deduction, and backtracking.

Furthermore, we formally provide an explicit upper bound (p·2^(p+1)) on CoT length (Theorem 4.5), a contribution extending beyond mere TC results. In particular, such upper bounds for TC results are unknow for 3-SAT.

Preliminary Section Clarity

Regarding your suggestion on swapping the preliminaries section for background on 3-SAT and DPLL, this is indeed a greatly helpful suggestion for readability, and we will certainly do that during revision. In particular, we will swap the preliminary section on transformers with Appendix C.1 on 3-SAT after rewriting for brevity.

Questions

The main connection between the experimental results is the shared Chain-of-Thought design. Our theoretical construction provides a specific CoT structure. The experiments then investigate whether Transformers can learn this structure from data. The strong intra-length OOD generalization (Table 1) suggests that training on this theoretically-grounded CoT does allow the model to learn a robust reasoning procedure applicable across different data distributions, overcoming limitations seen in prior work where models trained with CoT relied on statistical shortcuts [1].

We agree that additionally investigating the accuracy of models of different sizes trained on 3-SAT problems with Chain-of-Thought is an interesting and insightful direction of experimental investigation according to the O(p2)O(p^2) scaling of our theorems and aligns with the “scaling law” line of research. However, such experiments are computationally beyond our theoretical focus and available resources.

As you mentioned in the review, Transformer models would introduce significant overheads compared to modern SAT-solvers. Having that said, it may be possible to develop efficient parallelized SAT-solver operations such as unit propagation, satisfiability checking, and conflict detection on GPU architectures using the insights from Lemmas 4.7 and 4.8. Our construction may also be used as a differentiable SAT-solving component that allows LLMs to formulate the underlying logic within natural language reasoning traces as SAT formulas and use traditional SAT solvers to perform more reliable reasoning.

[1] Zhang, et al. On the paradox of learning to reason from data. IJCAI '23

审稿意见
2

This paper studies the deductive logical reasoning capabilities of decoder-only Transformers. As claimed by the authors, many researchers reject the idea that LLMs are capable of reasoning, and there is little understanding of the fundamental limitations in the reasoning abilities of Transformer models. In this work, the authors claim that they prove by construction that decoder-only Transformers can solve 3-SAT in a non-uniform model of computation, and the instantiated Transformer corresponding to the theoretical construction can perfectly solve 3-SAT instances. The authors conduct experiments to evaluates the model’s performance and ability to generalize to formulas with a different number of variables than seen during training.

给作者的问题

Actually, SAT is so facinating because of its usefulness in real-world applications. Hence, it is more important to adopt industrial instances (e.g., those instances from SAT Competitions) in your study, rather than random instances. This significantly reduces the signficance of this work. Beides random instances, how does the method perform on industrial instances (e.g., those instances from SAT Competitions)?

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Proposed method make sense for the problem. But for the instances in the experiment, the number of variables is small.

理论论述

Yes.

实验设计与分析

The experimental design to evaluate the effectiveness and generalizability of the method is reasonable.

补充材料

All appendices are reviewed.

与现有文献的关系

The paper discusses and proves the logical reasoning capabilities of decoder-only Transformers in the 3-SAT problem, which provides reference for the view that "LLM can reason".

遗漏的重要参考文献

N/A.

其他优缺点

Strengths:

  1. The paper is generally well-written, with clear explanations.
  2. The code of the method is available.
  3. Experimental results are satisfactory.

Weaknesses:

  1. To verify the logical reasoning capabilities of decoder-only Transformers, It would be good to add research on other types of SAT problems.

其他意见或建议

N/A.

作者回复

Thank you for your devoted time in reviewing the paper. We're glad that you found the writing and experimental results of our work satisfactory.

To verify the logical reasoning capabilities of decoder-only Transformers, It would be good to add research on other types of SAT problems.

The suggestion is highly insightful! Including more complicated SAT instance structures beyond CNF formulas may reveal further insights into more complicated Transformer reasoning, especially when processing hierarchical data. This is currently beyond the scope of this work, and we would certainly look into these directions in future work. For the present work, since it’s well known that all SAT formulas can be converted into 3-SAT in polynomial time, and we are the first theoretical work investigating the capacity of Transformers in formal logical reasoning, we believe that our current contributions are sufficient for the current manuscript.

Beides random instances, how does the method perform on industrial instances (e.g., those instances from SAT Competitions)?

Regarding your question, we fully acknowledge the importance and impact of practical SAT solving, but we would like to clarify that our goal is to advance our theoretical understanding of the reasoning capability of Transformer models. Practical SAT solving is orthogonal to this work and our goal. We chose 3-SAT as the theoretical basis for our study because it’s a fundamental NP-complete problem in complexity theory that represents deductive logical reasoning. It is unlikely that our model will have any practical advantage over traditional SAT solvers. The DIMACS encoding and CoT of practical instances are significantly longer than the context lengths of our models. Therefore, we did not use any practical SAT benchmarks for evaluation.

We’re happy to answer any additional questions you may have regarding our work. We also sincerely hope that you can evaluate the theoretical aspects of contributions in more detail if possible.

审稿意见
2

This paper investigates the deductive logical reasoning capabilities of decoder-only Transformers. The author(s) opt for 3-SAT problem as a representative example of logic reasoning task and use Transformer model to achieve reasoning via Chain-of-Thought (CoT).

给作者的问题

  1. please clarify weakness 1 and 2
  2. The Transformer model also fails to achieve 100% accuracy on small instances (p ≤ 20), whereas traditional SAT solvers (e.g., DPLL, CDCL) solve such problems with perfect accuracy. Does it suggest that self-attention-based models are inherently limited in practical scenarios?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Theorem 4.5 and Theorem 1.1 cannot hold based on the experimental results, where the Transformer models fail to generalize to instances with more than 12 variables.

实验设计与分析

All experiments are conducted on very small-scale instances, where the number of variables in the 3-SAT problems is fewer than 20.

补充材料

Yes, I review the appendix.

与现有文献的关系

This paper is the first work to theoretically analyze the ability of Transformer models to solve 3-SAT problems. It is related to the literatures about AI/LLM reasoning.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. The paper provides a thorough theoretical analysis of the ability of Transformer models to solve 3-SAT problems, offering valuable insights into their logical reasoning capabilities.
  2. The approach of tokenizing 3-SAT problems and employing a decoder-only Transformer to solve them is novel.

Weaknesses:

  1. Lack of formal justification for model hyperparameters (L=7, H=5) Theorem 4.5 asserts the existence of a decoder-only Transformer with L=7 layers and H=5 heads capable of solving 3-SAT. It seems that Appendix C.6 is the corresponding proof. However, Line 1358 seems like a hypothesis of the function of the embedding layers. Is there any theoretical proof of this model configuration or even ablation study of each layer?

  2. Mismatch between theorem and empirical results. Theorem 4.5 claims universality ("for any p, c ∈ N+"), but the experiments only validate the construction on small instances (p ≤ 20 variables). The results show that Transformer models cannot be generalized to the instances with more than 12 variables.

其他意见或建议

The main body of this paper is well-organized, but the appendix is not clear. For example, while Appendix C.6 appears to provide a detailed analysis of Theorem 4.5, this link is not explicitly stated in the main text. The additional experiments in Appendix B are not clear.

伦理审查问题

N.A.

作者回复

We greatly thank you for the detailed comments and feedback. We appreciate the reviewer’s careful reading and insightful concerns. We seek to clarify certain potential logical misunderstandings, hoping to address your identified concerns regarding the theoretical results. Most importantly, the correctness of Theorem 4.5 is justified by the proof in Appendix C, rather than the experimental results.

Theorem 4.5 and Theorem 1.1 cannot hold based on the experimental results, where the Transformer models fail to generalize to instances with more than 12 variables.

Mismatch between theorem and empirical results. Theorem 4.5 claims …

We fully appreciate the reviewer’s detailed review. However, we kindly clarify that Theorem 4.5, which shows the existence of a Transformer model that solves SAT, refers explicitly to the Transformer weight configuration rigorously defined in Appendix C. The empirical models, which showed limited generalization beyond 12 variables, were trained from data and thus had fundamentally different weight configurations. The configuration corresponding to Theorem 4.5 and described in Appendix C.6 is also implemented and corresponds to the "Compiled" results presented in Figure 3 and Section 5, achieving perfect accuracy on all tested instances. Again, we note that while such perfect accuracy is implied by Theorem 4.5, no amount of empirical tests can justify the correctness of Theorem 4.5. The experiment results for the compiled model only serve as a “sanity check” on smaller instances. Instead, the correctness depends on the Proof presented in Appendix C. Thus, the empirical limitations in generalization of the trained models and the number of variables in the tested formulas do not contradict our theoretical claims.

Lack of formal justification for model hyperparameters (L=7, H=5). It seems that Appendix C.6 is the corresponding proof…

We would like to clarify that these hyperparameters are indeed rigorously justified through the theoretical construction explicitly presented in Appendix C.6. Specifically, the configuration of the theoretical construction is provided between lines 1363 and 1469, detailing how each of the 7 layers explicitly contributes to achieving the theoretical result. The 5 heads stem from the fact that the operations in layer 5 require 5 attention heads to complete, and the number of attention heads must be the same across all layers by definition of the Transformer architecture. The operations are described at a high-level, and each individual operations have a corresponding lemma in previous sections on how the operations can be represented as attention or MLP layers. Also, we do not claim that the 7 layers and 5 heads are minimal for SAT solving, but rather an upper bound on the number of layers and heads required to solve 3-SAT.

Line 1358 mentioned in your review describes the “embedding layer” of the construction, which converts each token to input vectors before layer 1.

On the possibilities of ablation studies, while ablation studies are typically valuable in empirical research contexts for determining the influence of each component of a machine learning model, they are not applicable to theoretical proofs.

while Appendix C.6 appears to provide a detailed analysis of Theorem 4.5, this link is not explicitly stated in the main text.

In lines 238-239 of the main text, we mentioned that the proof of Theorem 4.5 is in Appendix C, and Appendix C.6 is not only a “detailed analysis” of Theorem 4.5 but a part of its proof that describes the construction.

The additional experiments in Appendix B are not clear.

Thank you for pointing out the missing references to the Figures from the main paper! For the experimental result in the appendix, Figure 5 corresponds to question 3, “How does error induced by soft attention affect reasoning accuracy?” on lines 322-323 (right), while Figure 6 adds the experimental results on the 2 remaining evaluation datasets, Random and Skewed, compared to Figure 3 in the main text. Thank you for pointing out the missing references to these figures in the main text. We will update the paper to clarify and add references to the additional experiment results.

Does it suggest that self-attention-based models are inherently limited in practical scenarios?

Our work focuses on advancing our theoretical understanding of the capabilities of Transformers in logical reasoning rather than how Transformers can be used for practical SAT solving. Our results show that given the number of variables and clauses, it’s possible to construct a Transformer model that solves 3-SAT. These models indeed have efficiency limitations compared to traditional SAT-solvers like CDCL and DPLL. Having that said, it might be possible to design for efficient SAT-solving mechanisms by exploiting the parallelism introduced in Lemma 4.8 and GPU architectures.

审稿人评论

Thanks for the authors' response. Most of my concerns have been addressed properly. However, the current empirical results are far from practical use. Modern SAT solvers can easily handle instances with tens of hundreds of variables, while this study focuses on instances with less than 12 variables. The scalability issue needs to be addressed or at least potentially addressable. Therefore, I'll keep my rating.

作者评论

Thanks for your reply. We're very glad to hear that our comments helped address your concerns.

Regarding your new point on the scalability of our method, we would like to kindly clarify that our work is a theoretical work investigating the capabilities of the Transformer architecture. In particular, our main contribution is on the mechanisms that allow the Transformer to perform (parallelized) logical reasoning and an efficient Chain-of-Thought for 3-SAT based on backtracking and deduction that Transformers can provably simulate. The experiments are supplementary evidence that investigates how the Chain-of-Thought of our theoretical construction allows for effective learning. Practical SAT solving, while important, is orthogonal to our contribution.

Multiple previous theoretical works have also included supplementary experiments on how well Transformers can perform multi-digit addition [1], evaluate arithmetic expressions[2], solve linear equations [2], parity checking [3], and simulate semiautomata [4]. All of these procedures can be much more efficiently simulated by a regular computer program. Similarly, our work also uses 3-SAT as a theoretical model for logical reasoning, and we do not claim that our method can compete with traditional SAT solvers in terms of efficiency. Therefore, we respectfully disagree that "the current empirical results are far from practical use" constitutes a significant limitation/weakness of our work. While we're glad to discuss the potential implications of our theoretical result in practical SAT solving, we hope that you can focus more on our theoretical contributions rather than the practicality of our experiments in terms of evaluating our paper.

Thanks so much for your consideration.

[1] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [2] Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective [3] What Algorithms can Transformers Learn? A Study in Length Generalization [4] Transformers Learn Shortcuts to Automata

审稿意见
5

The authors present a theoretical foundation and proof that it is always possible to construct an optimal decoder-only transformer that is capable of exactly simulating the DPLL search and solving any 3-SAT task with greedy decoding that uses CoT reasoning steps and backtracking. The authors show that given any 3-SAT input with p variables, it is possible to create a set of modules (all of which are separate parts/layers and together comprise a complete transformer) with each simulating a portion of the DPLL search process. This reveals that transformers are capable of optimally solving any 3-SAT task (which is NP). The paper also shows that for p variables and c clauses, there exists a transformer that can optimally solve the 3-sat problem with no more than p*(2p+1) CoT iterations for a model with L = 7 layers, 5 heads and O(p^2) parameters. Additionally, the authors show that even pretrained transformers can be trained to solve 3-SAT tasks where the number of variables is close to the amount of variables seen during the training. Although, out of distribution generalization w.r.t. the amount of variable in a 3-SAT problem is still non-trivial for transformers.

给作者的问题

  1. The compiled model (constructed) achieves perfect accuracy on 3‑SAT instances. Given that the worst-case CoT length is theoretically exponential (because 3-SAT is NP), can the authors provide insights into the characteristics of SAT instances where the model’s CoT length approaches this worst-case behaviour and how often such instances can occur realistically?

  2. Given that the amount of chain of thought tokens is massively lower than the theoretical bound, can it be assumed that having more compute during test-time would allow the models better generalization w.r..t. the number of variables? What happens as the CoT gets longer? Do the models collapse and start repeating explored patterns? Do they hallucinate?

Post rebuttal comments

I feel all of my concerns have been addressed after the rebuttal. Increasing the score.

论据与证据

All of the claims are valid and theoretically proven or empirically evaluated. The theoretical background given for the paper is rather extensive and covers a broad spectrum of non-trivial decoder-only transformer decompositions into modules that can be viewed as a deterministic component of the DPLL (PARAT).

方法与评估标准

The only empirical components of the paper explore trying to train a pre-trained decoder only Transformer to solve the 3-SAT task. The evaluation is straightforward by looking at the explicit accuracy of the SAT/UNSAT labels produced after CoT and Backtracking.

理论论述

The theoretical claims of the paper are rather extensive, maintaining that given a set of variables p, it is always possible to construct a decoder-only transformer that would be able to exactly and optimally solve any 3-SAT problem akin to DPLL while using CoT reasoning and backtracking. All of the claims are adequately supported, albeit It must be mentioned that without reading the complete appendix it is impossible to understand the main idea and components of the paper. My only problem with the paper is the way that the idea is presented in the main part of the paper.

实验设计与分析

The experimental design is straightforward and tries to validate whether arbitrary pre-trained decoder-only transformers can learn to solve 3-SAT problems. The accuracy for SAT/UNSAT final outputs shows that learning the task is feasible, but generalization outside of the number of variables present in the training set is non-trivial and complicated.

补充材料

I had to walk through most of the appendix to properly understand the bulk of the method as the majority of theoretical explanations in the main part of the paper lack rigour and flow of explanation (extremely hard to follow the main paper). The whole construction (PARAT) section I postulate can only be understood from the appendix.

与现有文献的关系

The paper explores a fundamental transformer capability, revealing that transformers are able to solve the 3-SAT task, which falls in line with research exploring the theoretical limitations and expressiveness of the transformer architecture.

遗漏的重要参考文献

N/A

其他优缺点

The main weakness is the way that the method is explained in Section 4. Most of the explanations and flow seem handwavy and not rigorous. However, their counterparts in the Appendix more than cover the need for rigour. I would still maintain that changing the way the explanation flows in favour of more strict but understandable transitions would be easier for the reader to understand.

其他意见或建议

I think the solution for the example on page 25, line 1355 in the appendix for the proposed 3-SAT example is wrong.

作者回复

We’re very grateful for your devoted time and careful review of both the main paper and the appendix. We’re also very glad that you liked the contributions of our paper and consider the proof and description in the appendix rigorous and helpful. We would like to address your comments regarding the difficulty of understanding the proof sketch and your opinion of possible improvements to its clarity, suggested below:

Clarity Improvements to the Proof Sketch

Swapping Preliminary Section on Transformers with Appendix C.1 on 3-SAT (also suggested by reviewer sNWz): From your reviews as well as suggestions from reviewer sNWz, we believe that the most unclear part in the main paper is likely regarding 3-SAT operations and DPLL. As such, we will swap out the detailed mathematical definitions of the Transformer architecture in section 2 with descriptions of 3-SAT in Appendix C.1

Organize Proof Sketch around Major Steps in the Construction: For Section 4, we would like to organize the sketch around the multiple major steps of the Transformer model? i.e., we organize into the following paragraphs:

Step 1: Summarize Clauses and Assignments as Binary Vectors: where we introduce definition 4.6 and explain how the binary encodings can be computed by summing up one-hot encodings in each clause using the attention mechanism of the Transformer

Step 2: Perform Parallel Logical Operations over Clauses: where we introduce Lemmas 4.7 and 4.8 and explain how to determine Satisfiability, Conflict, and Unit Propagation

Step 3: Next Token Prediction: where we have a description on how the final token is decided based on the results calculated in the previous layers

Example on line 1355

Thank you for your careful reading of the appendix. After careful checking, the example is indeed wrong. The correct solution should be: [SEP] D 2 D 1 -4 3 [BT] D 2 -1 -4 [BT] -2 D 3 D 4 1 SAT We will also update this in the paper.

Questions

Can the authors provide insights into the characteristics of SAT instances where the model’s CoT length approaches this worst-case behaviour and how often such instances can occur realistically?

We expect all instances to be lower than this bound, but we do not know of tighter upper bounds that are strictly proven. The number of CoT steps required is dependent on many factors, including the heuristic used to select the next decision (assumption) at each iteration, etc. Providing tighter theoretical upper bounds on solving different types of SAT formulas is an active direction of research in the SAT solving community (e.g., see [1] Section 1.5). We instead used a large upper bound that is guaranteed to be true on all instances and heuristics.

Can it be assumed that having more compute during test-time would allow the models better generalization w.r..t. the number of variables?

We do believe this to be true according to relevant literature on practical LLMs, where longer CoT leads to better reasoning performance. However, it’s difficult to empirically test this hypothesis in our setting since there isn’t a reliable way to increase test-time computation for custom-trained 3-SAT models. In particular, our models are deterministic and do not support additional prompts to elicit longer reasoning chains. Your suggestion is indeed an insightful and promising direction for future work.

What happens as the CoT gets longer? Do the models collapse and start repeating explored patterns? Do they hallucinate?

Yes. We can investigate this phenomenon by observing how models trained on large (11-15) instances behave on smaller instances. For example, on a 3-SAT problem with 5 variables, the model outputs: (recall that D represents “Assume” and [BT] represents “BackTrack” according to Appendix C.1)

D -4 D -1 -5 2 3 D [BT] D -4 D -1 -5 2 3 D -4 D [BT] D -4 D -1 -5 2 3 D -4 D -5 SAT

The model can be considered to be hallucinating since the final D at the end of the first attempt should never occur. Similarly, at the final assignment before SAT, both -4 (x4=Fx_4=F) and -5 (x5=Fx_5=F) occurred 2 times, which shows that the model starts repeating explored patterns. Both of these phenomena are quite common when the CoT is longer than normal.

In another test sample of 5 variables, the model outputs:

D -4 D -3 1 2 -5 -4 -18 17 -26 SAT

Which hallucinates variables that do not exist in the formula (and also outputs -4 twice).

What’s more interesting is that the models still have high SAT vs UNSAT accuracy even when these hallucinations/repetitions occur and our test data explicitly removes statistical evidence on SAT vs UNSAT. This can be seen from Figure 3 (right) where the SAT vs UNSAT accuracy remains high even when the Full Trace Correct accuracy falls off significantly. It seems that the trained model still has ways to “guess” the correct SAT vs UNSAT even with a CoT that’s not fully correct and longer than normal.

[1] On The Unreasonable Effectiveness of SAT Solvers

最终决定

A subset of reviewers and the AC agree that the paper makes interesting and novel contributions to the theoretical study of Transformer-based reasoning for 3-SAT. Two reviewers request experimentation on larger scale problems, but I agree with the authors that such requests are orthogonal to their key claims and contributions that are focused on theoretical understanding and validating its reduction to practice.

While the techniques and results are perhaps not surprising, the results themselves are novel and the quality of the discussion and presentation coupled with some critical insights (e.g., soft vs. hard attention differences) make this paper a solid and timely contribution on the topic of logical reasoning and Transformers.

During revision, the authors should make an effort to incorporate clarifications and helpful rebuttal discussion in their reviews. I also include below some critical feedback that the authors should consider as they prepare their final version:

  1. The paper has two divided parts: first, a hand-crafted compiled model that does perfect inference on tiny SATs, utilizing CoT as a “tape” for arbitrary memory size ( which grows O(p×2p+1p \times 2^{p+1}), which is actually \approx inference speed). In the end, it’s basically a manually written matrix program construction running in exponential efficiency, which does not seem to be a surprising result for 3-SAT although bounding the number of layers, heads, and parameters is a useful technical contribution. The authors' current discussion of contributions focuses more on what they technically achieve and less on what parts are surprising and likely to impact downstream use of these results.
  2. The authors claim that the compiled model is a standard Transformer, but it uses hard attention. In the appendix, they actually show that softer attention doesn’t really work, so their other claim that their study reveals “LLM + CoT can learn SAT” requires some qualification in my view.
  3. The authors' second approach to train a model (soft attn Llama) obviously yields a different algorithm than their compiled version.
  4. Overall, the theoretical approach (compiled model) isn't new, but the domain-specific insights and bounds for 3-SAT are interesting, the attention discussion is important and useful, and hence the paper should generally be informative for the community interested in SAT-oriented logical reasoning with Transformers.