Skip to content

Formal Verification: The Proof Oracles Your AI Agent Doesn't Know It Needs

7 min read

Every conversation about AI-generated code ends the same way: “But how do we trust it?”

The answer is always more tests. But tests are fundamentally sampling. You pick inputs, check outputs, and hope the gaps don’t hide something catastrophic. For an AI agent making 40+ tool calls per session, generating financial logic at 3am while you sleep? Sampling isn’t enough.

Formal verification tools don’t sample. They prove. They encode your program into logic and either guarantee a property always holds, or hand you a concrete counterexample showing exactly how it breaks.

One Example That Explains Everything

Your AI agent writes a late fee calculator:

def late_fee(days_overdue: int, balance: float) -> float:
    if days_overdue <= 0:
        return 0.0
    fee = balance * 0.02 * days_overdue
    return min(fee, balance * 0.25)  # cap at 25% of balance

A test checks three inputs. A formal verifier asks: “Is there any input where the fee exceeds 25% of balance?” It explores the entire input space mathematically. If it finds one, you get the exact values. If not, you get a proof that none exist.

ApproachWhat you learn
Unit test”These 3 inputs work”
Property test”1000 random inputs didn’t break it” (probabilistic)
Formal verification”No violating input exists” (proven) or “Here’s the exact one that breaks it”

That’s the end goal. Not “probably correct.” Proven correct.

What Traditional Tests Already Miss

This isn’t just an AI problem. Traditional software has gaps that testing can’t close:

  • Concurrency: You can’t test all thread interleavings. The race condition that double-debits shows up at 3am in production, never in CI (Continuous Integration).
  • State machines: A workflow with 8 states and 15 transitions has hundreds of paths. Tests cover the happy path and maybe 3 error cases.
  • Permissions: “Tenant A can’t access Tenant B” is easy to test for cases you think of. Breaches come from the cases you don’t.
  • Invariants over time: “The ledger balances” is trivial for one transaction. It’s the sequence of 10,000 with retries, timeouts, and crash recovery that breaks it.

Amazon adopted TLA+ (Temporal Logic of Actions) for exactly these problems. Their team found bugs in DynamoDB, S3, and EBS that no amount of testing or code review had surfaced, in designs running in production for years.

The Tool Families (Quick Reference)

You don’t need to learn all of these. Pick the one that matches your pain:

If your problem is…Use thisWhat it doesExample
”Can this input break my function?”Z3 (SMT solver, Satisfiability Modulo Theories)Finds constraint violations or proves none exist”Can transfer() produce a negative balance?"
"Does my algorithm satisfy its spec?”Dafny (auto-active verifier)Proves annotated code correct for all inputsBinary search never goes out of bounds
”Can my system reach a bad state?”TLA+ (model checker)Explores all interleavings of a concurrent systemSettlement pipeline can’t double-debit
”Is there a loophole in my schema?”Alloy (relational modeler)Finds counterexamples in data modelsShared role bridges two tenants
”Does my code have null/memory bugs?”Infer (static analyzer)Proves absence of specific bug classesNo null deref on the timeout path
”Can transactions break my contract?”Certora (smart contract verifier for DeFi / Decentralized Finance)Proves on-chain invariants holdNo transaction sequence drains the pool
”I need absolute mathematical proof”Lean / Coq (proof assistants)Machine-checked proofsVerified compiler, verified kernel

Where It Sits in the Quality Stack

Formal verification doesn’t replace testing. It occupies a different point on the cost-confidence spectrum:

Linting           ░░░░░░░░░░  catches style issues
Type checking     ██░░░░░░░░  catches type errors
Unit tests        ████░░░░░░  catches regressions (sampled)
Property tests    ██████░░░░  catches edge cases (random)
Static analysis   ████████░░  PROVES absence of bug classes
Formal verify     ██████████  PROVES arbitrary properties

Use each layer for what it’s good at. Don’t reach for formal verification when a type checker would do. But don’t rely on tests alone when the bug costs real money.

If the bug costs…Tests are enoughConsider formal verification
Developer timeYesNo
Customer trust (SLA / Service Level Agreement breach)ProbablyProperty tests + static analysis
Real money (double charges)NoModel check the protocol
Legal liability (regulatory)NoVerify the core logic
Catastrophic (safety-critical)Definitely noProof assistant

Where It Gets Expensive and Complicated

I won’t pretend this is free. Three honest costs:

1. You have to write specs. Every tool requires you to state what you’re proving. The spec for a sorting function is simple (output is ordered, is a permutation of input). The spec for “correct payment processing” involves 47 business rules that change quarterly. If your spec is wrong, your proof is misleading.

2. There’s always an abstraction gap. TLA+ verifies your model, not your code. Dafny verifies Dafny code, not your Python. Infer verifies your actual code but only checks mechanical properties (null safety, memory safety), not business logic. Nothing verifies arbitrary Python logic exhaustively today.

3. State space explosion. Model checkers explore all states. A system with 3 integers (0-1000) has a billion. Add concurrency and you’re at trillions. You’re always trading completeness for tractability.

What It Costs to Run

The good news: compute is nearly free. Most tools are open-source and run on a laptop.

ToolLicenseCI timeReal cost
Z3, Dafny, AlloyMIT (free)5-60 seconds$0 compute
TLA+ / TLC (TLA+ model checker)MIT (free)10s to hours (state-space dependent)Benefits from 16+ cores
InferMIT (free)2-15 min for large codebases$0 compute
CertoraCommercial5-30 min (cloud)Paid SaaS (Software as a Service)
Lean / CoqFreeSeconds per step$0 compute

The expensive part is human time: a verification engineer costs $150-250/hr, and writing a TLA+ spec for a payment protocol takes 2-5 days.

Why AI Agents Change the Economics

Here’s the shift. Three things that made formal methods impractical for most teams are exactly what LLMs (Large Language Models) are good at:

Writing first-draft specs. The hardest part was always “state what you want to prove.” Give Claude a function and say “write a Dafny spec.” The first draft is usually 80% correct. That collapses days of specialist work into minutes. Cost: $0.50-5.00 in tokens.

Parsing counterexamples. When the verifier says “FAIL: counterexample at x=5, y=-1,” the agent extracts that, understands why it breaks the invariant, and fixes the code. This is dramatically richer feedback than a test saying “expected 10, got 11.”

Iterating for free. A human iterating on a proof costs $200/hr. An agent iterating costs tokens. The verify-fix-reverify loop that makes formal methods tedious for humans is exactly what agents are built for.

The integration is the same tool-use pattern your agent already uses. The verifier CLI (Command Line Interface) is just another tool call:

Agent generates code + spec
  → calls verifier CLI (tool use)
  → PASS? done, proven correct
  → FAIL? parse counterexample, fix, re-verify

This is harness engineering applied to correctness. The verifier is just another tool in the environment, like the evaluator agent is a tool for quality. Except this one returns mathematical proof instead of a vibes-based judgment.

Getting Started (Under 1 Hour)

ToolInstallFirst experiment
Z3pip install z3-solverProve a function can’t return negative values
DafnyVS Code extensionVerify a sorting function with pre/postconditions
TLA+TLA+ ToolboxModel-check a simple mutex
Alloyalloytools.orgFind a counterexample in a 3-entity access model
Inferbrew install inferRun on existing C/Java code, read the report

Pick the one that matches where you’re losing money. Don’t boil the ocean.

The Bottom Line

Formal verification has been “almost ready for mainstream” for 30 years. Too hard to use. Specs too hard to write. Iteration too slow.

AI agents break that stalemate. They write specs at token cost, parse counterexamples natively, and iterate without burning engineer-hours. The verifier becomes just another layer in your agent’s stack, the same way every other hard problem got solved: by wrapping it in a tool call.

You don’t need to verify everything. Most code doesn’t need proof. But the payment calculation that runs 2M times a day? The permission model guarding PHI (Protected Health Information)? The settlement state machine that can’t double-debit? Those deserve better than “these 3 inputs worked.”


Exploring formal verification for AI-generated code? I’d love to hear what properties you’re trying to prove. Reach out on LinkedIn.