Scientific
formal verification ai
lean ai
proof assistant ai

Theorem Proving

AI-assisted formal theorem proving in Lean, Coq, Isabelle. DeepSeek-Prover, Lean Copilot, AlphaProof-lineage.

Capability notes

AI-assisted theorem proving in 2026 operates primarily through **Lean 4** — the dominant interactive theorem prover with the largest open-source math library (mathlib4, 1.5M+ theorems). **Coq/Rocq** has deeper formalization history but weaker LLM tooling. The AI integration story centers on lean-copilot, a VS Code extension connecting proof context to LLM backends. **What LLMs can do.** The capability ceiling is **proof completion**: given a theorem statement and a partial proof skeleton, an LLM fills in remaining tactics. On standard library proofs (algebra, number theory, linear algebra), [DeepSeek V3](/models/deepseek-v3) and [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) achieve 40-55% completion rates on LeanDojo, with correct proofs requiring 2-4 LLM attempts per lemma. Completion drops to 15-25% on novel proofs requiring synthesis of multiple mathlib lemmas without local syntactic overlap. **What LLMs cannot do.** Autonomous proof generation from plain-English statements fails 90%+. LLMs cannot detect circular reasoning — generated proofs that assume the theorem being proved will type-check but are logically vacuous. They struggle with dependent type manipulations, universe level constraints, and termination proofs for recursive functions — these are architecture-agnostic failures, not model-scale limitations. LLMs systematically generate the longer, more fragile proof path when multiple valid proofs exist. **lean-copilot** feeds proof context (theorem statement, hypotheses, goals, open namespaces) from the Lean 4 LSP server to an LLM backend. Two modes: auto-complete (fills the next tactic, 50-60% acceptance) and full-proof (attempts to close all goals, 15-25% acceptance). Full-proof mode is an exploration tool, not a trusted proof generator. **Landscape.** Lean 4 is winning — mathlib4 is the fastest-growing formal math library. Coq/Rocq has stronger extraction-to-code capabilities (verified algorithms to OCaml/Haskell) but sparse AI tooling. Isabelle and HOL have niche formalization communities with negligible AI integration. For AI-assisted proving, Lean 4 is the practical choice.

If you just want to try this

Lowest-friction path to a working setup.

Install Lean 4 and lean-copilot for VS Code. Budget 2-3 weeks to learn Lean syntax through the Natural Number Game and "Theorem Proving in Lean 4" chapters 1-5 before AI assistance becomes productive. Step 1: Install Lean 4 via `elan` (Lean's version manager). On macOS/Linux: `curl https://raw.githubusercontent.com/leanprover/elan/master/elan-init.sh -sSf | sh`. Windows: use the official installer. Verify: `lean --version` — need 4.7+. Step 2: Install VS Code with the `lean4` extension for syntax highlighting and LSP. Open a `.lean` file; the infoview panel shows proof goals. Step 3: Install lean-copilot from the VS Code marketplace. Configure the backend: [vLLM](/tools/vllm) for local, or OpenAI-compatible API. For local inference, serve [DeepSeek V3](/models/deepseek-v3) or [Llama 3.3 70B](/models/llama-3-3-70b) via vLLM. Model quality matters decisively — 7B and 13B models cannot produce useful Lean proofs. 70B+ is the practical floor. Step 4: Write a theorem statement in Lean syntax. lean-copilot reads the goal state from LSP. Invoke "Generate proof." The LLM returns a proof block; Lean's kernel type-checks it. If it fails, lean-copilot retries with the error as feedback. What you get: an interactive assistant where the LLM suggests tactics and the human evaluates correctness, logical coherence, and proof quality. The human-in-the-loop is essential — you must verify the LLM proved the intended theorem, not a modified version that happens to type-check. Critical: you need a 70B+ model served locally or API access to a frontier model. lean-copilot with a 7B model generates syntactically valid Lean that type-checks but proves vacuous or incorrect statements.

For production deployment

Operator-grade recommendation.

For operators evaluating AI theorem proving in a verification pipeline, the central question: does AI assistance provide net productivity gain or net verification overhead? **When it's productive.** Taxonomic proofs (proving a new type class instance satisfies its parent's axioms) follow rigid structural patterns that LLMs handle at 60-80% success. Lemma variants of existing mathlib lemmas. Mechanical steps: `simp` chains, `ring` simplifications, `linarith` arithmetic — tedious for humans, LLMs complete at 60-80%. A human formalizer who spends 30 minutes on mechanical lemma plumbing instead spends 5 minutes prompting and 10 minutes reviewing. The productivity gain is real for these categories. **When it's counterproductive.** Novel research-level proofs — LLMs cannot generate proofs for theorems absent from training data. They hallucinate plausible Lean code that type-checks but proves a weaker or different statement. Complex termination proofs — LLMs systematically fail at proving recursive termination because they cannot represent recursion structure. Universe-polymorphic reasoning and category-theoretic constructions push against Lean's inference and LLM capabilities. **Verification of generated proofs.** A Lean proof that type-checks is mathematically correct — the kernel guarantees it. The failure: the LLM rewrites the goal into something it can prove, proves that, and presents it as the original. Lean's kernel cannot detect this because the redefinition is valid Lean — it just doesn't correspond to the intended theorem. The human's essential role: verify the *proven statement* is semantically identical to the *intended theorem*. This is the most important step in an AI-augmented proof pipeline. **Pipeline architecture.** lean-copilot + [vLLM](/tools/vllm) serving a 70B+ model. For batch automation: create a Lean file with multiple lemma statements, run lean-copilot on each, collect type-checked proofs, flag proofs needing human semantic review. For interactive development: human and LLM collaborate with the LLM suggesting tactics and the human steering proof structure. **Model selection.** [DeepSeek V3](/models/deepseek-v3) leads on LeanDojo (~55% completion). [Llama 3.3 70B](/models/llama-3-3-70b) is competitive (~48%) and runs on a single [RTX 4090](/hardware/rtx-4090) at Q4 with partial offload. 32B models are not useful — completion drops below 20% and errors exceed 50% on simple lemma plumbing. Frontier closed-source models (Claude 3.7, GPT-5) perform well but require cloud API access, conflicting with some verification security requirements.

What breaks

Failure modes operators see in the wild.

- **Proof that type-checks but proves the wrong theorem.** The LLM subtly modifies the goal statement — adding an unnecessary hypothesis, weakening the conclusion, or specializing the type. Lean accepts it because the modified statement is valid. Symptom: proof compiles but a human reviewer discovers the proven theorem is a trivial corollary, not the target. Mitigation: inspect the exact goal statement the LLM proved. Implement a goal-diff check in lean-copilot: compare initial and post-tactic goals; flag changes. Never auto-merge generated proofs without human statement verification. - **Hallucinated lemmas that don't exist in mathlib.** The LLM invents lemma names following mathlib conventions that reference non-existent theorems. Symptom: Lean reports "unknown identifier" errors. Mitigation: use `#check` to verify lemma existence before accepting LLM suggestions. Search mathlib index (loogle.lean-lang.org) for each suggested lemma. Approximately 20-30% of LLM-suggested lemma names are hallucinations. - **Infinite proof search loops.** On hard theorems, the LLM generates a tactic, Lean rejects it, the error feeds back, and the LLM tries again — looping indefinitely. Symptom: GPU utilization at 100% for minutes with no proof progress. Mitigation: cap at 10 LLM attempts per goal. If the LLM suggests the same tactic 3 times consecutively, terminate — the theorem exceeds LLM capability. - **Lean 3 and Lean 4 syntax confusion.** Training data includes both versions. The LLM mixes syntax: `rw` vs `rw []`, `begin...end` blocks in Lean 4. Symptom: compile errors that waste time. Mitigation: use models trained predominantly on Lean 4 corpora (DeepSeek V3, Llama 3.3). This failure mode is annoying but not dangerous — it produces errors, not incorrect proofs. - **Informal-to-formal translation failure.** The LLM generates a correct natural language proof but Lean code implementing a different logic. Symptom: the human reads the text, agrees it's right, then discovers the code proves something else. Mitigation: never trust natural language explanations. Only the type-checked Lean code is the proof. Review Lean code directly; treat natural language output as commentary.

Hardware guidance

**Hobbyist: Consumer GPU with 16GB+ VRAM** [RTX 4070 Ti 16GB](/hardware/rtx-4070-ti) or [RTX 4080 Super 16GB](/hardware/rtx-4080-super) runs Llama 3.3 70B at Q4 with partial offload at 10-15 tok/s. Proof suggestion latency: 3-8 seconds — acceptable for interactive use. [MacBook Pro 16 M4 Max](/hardware/macbook-pro-16-m4-max) with 64GB unified memory runs the same model at 15-20 tok/s. 12GB GPUs cannot fit 70B models for useful Lean proof generation. **SMB: 2-4 person formalization team** One [RTX 4090](/hardware/rtx-4090) (24GB) serving Llama 3.3 70B Q4 via [vLLM](/tools/vllm) to 2-4 VS Code instances at 20-30 tok/s shared. vLLM continuous batching multiplexes requests. Cost: ~$1,800 GPU + ~$500 system = $2,300 one-time. For teams formalizing textbooks or verifying cryptographic protocols, this is dramatically cheaper than API calls per proof step. **Enterprise: Formal verification lab** [NVIDIA L40S](/hardware/nvidia-l40s) (48GB) or [RTX 6000 Ada](/hardware/rtx-6000-ada) (48GB) serves DeepSeek V3 at FP8 or multiple concurrent Llama 3.3 instances handling 5-10 formalizers at sub-5-second latency. Colocate GPU with the team for sub-10ms network latency — cross-continent latency adds 100-200ms per round-trip, compounding across proof steps. Air-gapped deployment satisfies defense and fintech verification requirements. **Frontier: Dedicated verification cluster** [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) (80GB) serves DeepSeek V3 at FP16 with 2 TB/s bandwidth, reducing latency to 1-2 seconds. A 4x H100 cluster serves 20+ formalizers. Justified only for critical infrastructure verification (compiler correctness, OS kernel properties, cryptographic protocol security) where a verification error has regulatory or safety consequences.

Runtime guidance

**Individual Lean user with a 70B-capable GPU? → lean-copilot + vLLM** lean-copilot communicates with any OpenAI-compatible API. Point it at a local [vLLM](/tools/vllm) instance serving [DeepSeek V3](/models/deepseek-v3) or [Llama 3.3 70B](/models/llama-3-3-70b). vLLM is preferred over [llama.cpp](/tools/llama-cpp) — continuous batching reduces latency for bursty proof queries. llama.cpp's sequential batch adds 1-3 seconds per query. Configure via VS Code settings: API endpoint to `localhost:8000/v1`, model name to your served model. Enable full-proof mode sparingly. **Team Lean AI deployment? → vLLM + shared GPU server** One GPU running vLLM with concurrent instances serves 2-10 lean-copilot clients. Set `max-model-len` to the model's full context window (32K for Llama 3.3, 128K for DeepSeek V3) — proofs pull in large mathlib contexts. Provision 5-8GB VRAM per concurrent user for KV cache. **Air-gapped or classified environment? → Air-gapped vLLM + local weights** vLLM runs offline. Download model weights once into the enclave, serve via vLLM, connect classified VS Code workstations. Mirror Lean 4 toolchain (kernel, mathlib cache) inside the enclave — `lake` fetches from a local mirror. Eliminates API-as-attack-surface risk. **Experimenting with Coq/Rocq? → Manual LSP integration** No lean-copilot equivalent for Coq. The Coq LSP exposes proof state through LSP protocol; build middleware to extract goals, format for LLM, receive tactics, insert into buffer. Plan 3-6 weeks of engineering for a minimum viable Coq AI assistant. Coq's LLM training data is sparser — 10-15% lower completion rates than Lean. For organizations not committed to Coq, Lean 4 is the more AI-viable system. **Specialized math models?** As of mid-2026, no open-weight model is fine-tuned specifically for theorem proving above baseline. DeepSeek V3 and Llama 3.3 are general-purpose models trained on enough mathlib/arXiv data to perform adequately. Monitor DeepSeek and Qwen families for future math-specific fine-tunes.

Setup walkthrough

  1. Install Lean 4: follow the installation guide at lean-lang.org (VS Code extension + elan toolchain manager). Takes ~10 minutes.
  2. git clone https://github.com/leanprover-community/mathlib4 (Lean's mathematical library).
  3. For AI-assisted proving: install Lean Copilot (VS Code extension) — uses a local or remote LLM to suggest proof steps.
  4. Write a simple theorem: theorem add_comm (a b : Nat) : a + b = b + a := by { ... } — place cursor after by, Lean Copilot suggests the induction + rewrite steps.
  5. First AI-assisted proof in <30 minutes of setup — you need basic Lean syntax knowledge (1-2 hours of learning).
  6. For stronger proving models: DeepSeek Prover V2 can be run locally via Ollama/VLLM and called from Lean via the Lean REPL + LLM bridge.
  7. Alternative: Coq + CoqPilot (VS Code extension) for Coq-based formal verification.

The cheap setup

Theorem proving is CPU-bound and RAM-light. Lean 4 + mathlib4 runs on any $300 laptop (Ryzen 5/Intel i5 + 16 GB RAM). The proofs themselves compile in milliseconds. For AI-assisted proving on a budget: use a cloud API (DeepSeek API, $0.50 per 1M tokens) for proof suggestions, or run a distilled reasoning model (DeepSeek R1 Distill 7B) on a used GTX 1060 6 GB ($60). The LLM is a suggestion engine — the proof checker (Lean kernel) is the authority and it's computationally trivial. $300 + free cloud API tier is genuinely viable.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs DeepSeek Prover V2 locally — the strongest open-weight theorem proving model. Generates full Lean proofs for undergraduate-to-graduate-level mathematics. Pair with Ryzen 7 7700X + 64 GB DDR5 + 1TB NVMe. Total: ~$1,800-2,200. For research-grade proving (IMO/IMO-level problems): the field is still dominated by closed-source frontier models. But DeepSeek Prover V2 + Lean Copilot on an RTX 3090 handles most undergraduate pure math problems. Formal verification (not proof discovery) runs on CPU alone.

Common beginner mistake

The mistake: Expecting an LLM to "auto-prove" a theorem without learning Lean or Coq syntax first. Why it fails: LLMs generate proof text, but you need to understand the proof assistant's error messages to iterate. The model says rw [add_comm] — if Lean rejects it, you can't fix it without knowing what rw does. Theorem proving with AI is a collaboration, not automation. The fix: Spend 2-4 hours learning basic Lean syntax (Natural Number Game is the canonical intro — lean-lang.org/nng). Learn what intro, apply, rw, induction, cases do. Then the LLM becomes a powerful autocomplete for proofs rather than a black box you can't debug. The LLM's job is suggesting steps, not guaranteeing correctness.

Recommended setup for theorem proving

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running theorem proving locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle theorem proving before committing money.

Specialized buyer guides
Updated 2026 roundup