The Intent Gap: How claimcheck Bridges Formal Verification and Programmer Intent

As AI-generated code becomes prevalent, claimcheck introduces round-trip informalization to verify that mathematically proven specifications actually match human intent.

The fundamental shift in programming paradigms brought about by large language models represents more than just a change in tooling—it signals a redistribution of cognitive labor between human and machine. Where programmers once meticulously crafted implementation details, they now increasingly articulate behavioral requirements in natural language, delegating translation to formal systems to AI agents. This transition creates a critical verification gap: while formal methods can mathematically guarantee that code satisfies its specifications, they offer no assurance that those specifications accurately capture the programmer's original intent. Our work on claimcheck addresses precisely this discontinuity through a technique called round-trip informalization, establishing a probabilistic bridge between machine-verified correctness and human-meaningful guarantees.

At its core, claimcheck operates through a deliberately decoupled two-phase process designed to surface discrepancies between formal lemmas and natural language requirements. In the first phase, an informalization LLM (defaulting to Claude Haiku with low temperature settings) receives only the formal Dafny lemma without any context about the original requirement. Its sole task is to translate the mathematical assertion back into plain English, effectively reverse-engineering the specification's semantic meaning from its formal representation. The structural isolation here proves crucial—by preventing the model from accessing the original requirement, we eliminate anchoring bias and force genuine interpretation of what the code actually guarantees. In the second phase, a separate comparison LLM (typically Claude Sonnet 4.5) analyzes both the original requirement and this back-translated description, judging whether they express equivalent guarantees. Outcomes classify into confirmed claims (semantic match) or disputed claims (weakened, strengthened, or divergent interpretations). Practical implementations batch-process these operations, executing just two API calls per file for efficiency.

Validation testing revealed claimcheck's effectiveness at catching subtle specification drifts that evade conventional review. When applied to an election tally system implementing 11 lemmas—including one deliberately weakened lemma stating that ballot additions couldn't decrease tallies when it merely proved non-negative counts—the tool correctly disputed both the planted error and an unexpected scope reduction. The latter case involved a lemma proving order invariance for prepend/append operations while the requirement claimed invariance under any permutation, a distinction invisible to formal verification but semantically significant. After corrections, claimcheck confirmed all specifications with zero disputes. Across broader testing spanning five domains (counter, kanban, colorwheel, canon, delegation-auth) and 36 requirement-lemma pairs—including eight intentionally flawed lemmas—the two-pass architecture achieved 96.3% accuracy across 108 comparisons. This substantially outperformed single-prompt approaches (86.1%) and general-purpose Claude Code pipelines (69.4%), while batch processing delivered a 3× speed advantage over individual API calls.

The implications extend beyond immediate utility in AI-assisted verification workflows. claimcheck represents an intermediate layer in the larger challenge of mapping natural language requirements to formal guarantees, providing rapid surface-level validation where exhaustive verification remains computationally impractical. Its structural separation principle—using distinct models blinded to complementary information—demonstrates higher reliability than integrated approaches, suggesting broader applicability in formal-informal translation tasks. Nevertheless, we acknowledge several inherent limitations: the probabilistic nature introduces approximately 3.7% error rates in our testing, inadequate for high-stakes systems without human review; informalization accuracy depends on lemma self-containment, potentially struggling with context-dependent proofs; and coverage completeness (requirements without lemmas or vice versa) remains unverified. These constraints don't invalidate the approach but delineate research frontiers as we evolve toward systems capable of full intent alignment.

Viewing claimcheck as a probabilistic sentinel rather than a proof mechanism reframes its role in the verification ecosystem. It operates at the intersection of formal methods and LLM-based semantic analysis, creating feedback loops that help programmers detect when machine-verified correctness diverges from intended meaning. As AI continues to reshape programming workflows, tools that narrow this intent gap will prove essential for maintaining semantic coherence between human thought and machine execution. The GitHub repository provides implementation details and test cases for further exploration.

The Intent Gap: How claimcheck Bridges Formal Verification and Programmer Intent

Comments