A comprehensive evaluation of whether large language models can solve SAT problems, revealing that even advanced models struggle with basic logical reasoning despite their impressive capabilities in other domains.
In recent years, large language models (LLMs) have demonstrated remarkable progress across various domains. From crafting prose in the style of famous authors to engaging in seemingly human-like conversations, these systems have captured public imagination. Yet, a fundamental question remains: can LLMs actually reason? To investigate this, I conducted a series of experiments testing whether LLMs can solve SAT (Boolean satisfiability) problems—a domain where consistent logical reasoning is essential.
Why SAT Problems?
SAT problems provide an ideal testing ground for LLM reasoning capabilities for several reasons. First, they require applying a small set of logical rules consistently, regardless of problem size. Whether dealing with three variables or three thousand, the fundamental principles remain unchanged. Second, SAT problems can be generated randomly, reducing the likelihood that models are simply pattern-matching from their training data. Finally, the binary nature of SAT problems—either satisfiable or unsatisfiable—makes evaluation straightforward.
A SAT problem asks whether a given Boolean formula can be satisfied by some assignment of true/false values to its variables. For example, the formula (a || b) && !a is satisfiable because setting b to true and a to false makes the entire expression true. Conversely, b && !b is unsatisfiable because no assignment can make it true.
Testing Methodology
I focused on formulas in Conjunctive Normal Form (CNF), where expressions consist of clauses connected by AND operators, with each clause containing variables connected by OR operators. This standardized format allows for easy validation using SAT solvers.
I generated three types of problems:
- SAT problems with 10 variables and 200 clauses
- SAT problems with 14 variables and 126 clauses
- UNSAT problems with 10 variables and 200 clauses
The 4-SAT variant (where each clause contains exactly four variables) was chosen because, with a clause-to-variable ratio above 10, problems become challenging while maintaining roughly equal probability of being satisfiable or unsatisfiable.
For each problem, I used the same system prompt instructing the model to determine satisfiability without external tools, providing reasoning, and outputting results in a structured JSON format with the satisfiability determination and, if applicable, a variable assignment.
Tested Models
I evaluated three models through OpenRouter.ai:
- Gemini 3 Pro: A flagship model expected to demonstrate strong reasoning
- GPT-4.2 Mini: A smaller, more efficient model
- GPT-4.2: The most capable model, though most expensive to test
All models had reasoning enabled with high effort settings. Due to cost constraints, I tested approximately five instances per model and problem type rather than the initially planned ten.
Results
Gemini 3 Pro
Despite being a flagship model, Gemini 3 Pro performed poorly. It failed to produce correct results in any tested instance. For SAT problems with 10 variables and 200 clauses, it typically claimed satisfiability but provided invalid assignments. In one case, it incorrectly labeled a satisfiable formula as unsatisfiable. For unsatisfiable problems, it consistently claimed satisfiability and fabricated assignments.
GPT-4.2 Mini
Surprisingly, the smaller GPT-4.2 Mini outperformed Gemini 3 Pro. It successfully found valid assignments for some SAT problems, though it struggled with unsatisfiable formulas, often fabricating assignments. When faced with particularly difficult SAT problems (200 clauses), it sometimes correctly identified that finding a satisfying assignment would be computationally prohibitive, which represents sound reasoning about performance limitations rather than logical failure.
GPT-4.2
The most capable model showed the best performance. It correctly solved all SAT problems with 10 variables and 200 clauses, finding valid assignments each time. When tested with more complex problems (14 variables, 100 clauses), it achieved a 50% success rate across four instances—essentially random guessing at this difficulty level. Like the other models, it struggled with unsatisfiable problems, consistently fabricating assignments.
Analysis and Implications
These results align with recent research showing that even state-of-the-art models degrade to random guessing on sufficiently complex SAT problems. The consistent failure across models to correctly identify unsatisfiable formulas is particularly telling—it suggests that LLMs lack the ability to perform exhaustive logical reasoning required to prove that no satisfying assignment exists.
Several factors may contribute to these limitations. As problem size increases, the context window becomes overwhelmed with clauses, making it difficult for models to track all constraints simultaneously. This mirrors challenges in software development, where complex systems with many interdependent rules become difficult to reason about comprehensively.
Conclusion
The evidence strongly suggests that current LLMs cannot consistently reason about SAT problems, despite their impressive capabilities in other domains. While they can sometimes find satisfying assignments for simpler problems, they fundamentally lack the ability to perform the exhaustive logical reasoning that SAT solving requires.
This limitation has important implications for practical applications. LLMs remain valuable tools even without robust reasoning capabilities, but we cannot simply encode rules and expect perfect compliance. For critical requirements, additional verification processes remain necessary. The gap between pattern recognition and genuine logical reasoning persists, suggesting that current architectures may need fundamental rethinking to achieve true reasoning capabilities.
The question "Can LLMs SAT?" appears to have a clear answer: not yet, and perhaps not with current approaches. As we continue to scale these models, we may need to reconsider what we mean by "reasoning" and whether different architectural approaches are needed to achieve it.
Comments
Please log in or register to join the discussion