Novelty

The proposed method is a pure prompt-based framework with a straightforward design. The specific design of performing reasoning without using external tools has been studied in several prior works [1,2]. That said the novelty of this work is minor.

Quality

The idea of "removing the tool usage yields better performance for deductive reasoning" is poorly motivated and justified

L48 "First, the process of converting logical problems into formal expressions leads to information loss"

This is true for problems without FOL groundtruth, such as ReClor and LogiQA which are evaluated in the experiments.
However, these problems are not meant to be solved with the traditional formal logic method in the first place, the prior work such as SatLM and LogicLM mostly focuses on solving the NLI task with datasets that come with groundtruth FOL annotations. Also note that ReClor and LogiQA contain not only deductive reasoning but also other reasoning tasks that cannot be characterized by FOL.
That said, criticizing translation leads to information loss is fine, but it hardly motivates the approach proposed here if it is meant to solve problems that already fall outside of the formal logic bucket.

L78 "Second, the reliance on specific external tools results in poor generalization of these methods, limiting them to solving only certain types of problems, such as FOL propositional inference problems or satisfiability problems"

This statement is problematic. Many works show tool usage increases rather than decreases the capability of LLMs in solving formal reasoning problems.
Formal tools such as Prover9 and Z3 can be used for not only propositional logic but also first-order logic. And sat problem is a very generic problem setting where many reasoning problems can be converted into a sat problem, and being able to solve sat problem should be not considered as a disadvantage.
That said, the authors should motivate their work properly.

Not every reasoning problem in ReClor and LogiQA can be formed into deductive reasoning:

The authors propose to solve all reasoning problems with deductive reasoning. This is simply inappropriate for many of the problems in ReClor and LogiQA. For example, ReClor contains questions like "which of the following most challenges/supports/aligns with the argument in the context?" and "which of the following arguments shares the same reasoning pattern as that in the context", such questions do not fit into any formal logic categories and certainly cannot be solved with deductive reasoning.

The experiment setting misses many details and is potentially problematic:

It's unclear how many ICL examples are used for GPT CoT baselines. However, an accuracy of 76 with GPT-4o on ReClor seems too bad to be true. As a comparison, [3] shows that with just a few ICL examples, GPT-3.5 can achieve about 60% accuracy and GPT-4 can achieve above 90% accuracy, which aligns much better with the scores reported in the public leaderboard.
As mentioned above, including methods like LINC in ReClor and LogiQA benchmarks is not sensible, as these methods are designed for NLI task and not these benchmarks.

Clarity

The paper is generally easy to follow.

Significance

While I agree with the authors that moving beyond standard NLI tasks into more "in the wild" reasoning problems such as that in ReClor is an interesting and important direction, it cannot justify the pure prompt-based design, as it effectively rendering the approach into yet another fancy CoT method that could hallucinate during its reasoning. From a pure performance perspective, the significance of this work is still questionable as the results from the baseline approach are too bad to be true. That said, the significance is also minor.

[1] Zhu, Zhaocheng, et al. "Large language models can learn rules." arXiv preprint arXiv:2310.07064 (2023).

[2] Feng, Jiazhan, et al. "Language models can be logical solvers." arXiv preprint arXiv:2311.06158 (2023).

[3] Yang, Yuan, et al. "Can LLMs Reason in the Wild with Programs?." arXiv preprint arXiv:2406.13764 (2024).