OCR / Document Text Extraction
Extracting text from images, PDFs, screenshots, and handwritten documents. Modern multimodal LLMs (Qwen2.5-VL, InternVL, GPT-4V) increasingly outperform specialized OCR engines on complex layouts.
Capability notes
If you just want to try this
Lowest-friction path to a working setup.
For production deployment
Operator-grade recommendation.
What breaks
Failure modes operators see in the wild.
Hardware guidance
Runtime guidance
Setup walkthrough
pip install surya-ocr(VikParuchuri's Surya — SOTA open-weight OCR for documents).- First run auto-downloads the detection + recognition models (~1 GB total). No manual setup.
- CLI:
surya_ocr image.jpg→ outputs a JSON with bounding boxes + text for every detected line. - For PDFs:
surya_ocr document.pdf --output_dir out/→ processes page-by-page, outputs per-page JSON + Markdown. - First result in 10-30 seconds on CPU for a single page; faster on GPU.
- Alternative for simple cases:
pip install pytesseract(wraps Tesseract) — faster but worse on complex layouts. - For multimodal LLM OCR:
ollama run minicpm-v→ upload an image → ask "Transcribe all text in this image."
The cheap setup
Surya OCR runs on CPU at 10-30 seconds per page on a modern laptop (Ryzen 5/Intel i5). No GPU required. Any $300-400 laptop handles batch OCR of documents overnight. For faster throughput: a used GTX 1060 6 GB ($60) drops per-page time to 2-5 seconds. For multimodal LLM OCR (Qwen2-VL, MiniCPM-V), a used GTX 1660 Super 6 GB (~$100) handles 7B VL models at 5-10 seconds per image — good enough for complex layouts like tables and forms.
The serious setup
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb). Runs Surya OCR at 1-3 seconds per page on GPU. Can run Qwen2-VL 7B at 3-5 seconds per image for complex document understanding (combined OCR + layout + table extraction). For production document pipelines processing 1000s of pages/day, pair with Ryzen 7 7700X + 32 GB DDR5 + 2TB NVMe. Total: ~$900-1,100. OCR is VRAM-light — 6 GB is sufficient for most models.
Common beginner mistake
The mistake: Running Tesseract with default settings on a scanned document with complex layout (multi-column, tables, headers) and getting garbled output. Why it fails: Tesseract is a line-level OCR engine — it doesn't understand document layout. On multi-column PDFs, it reads across columns, mixing unrelated text. The fix: Use Surya OCR or a multimodal VL model (Qwen2-VL, MiniCPM-V) for complex documents. These models understand layout — they detect columns, tables, headers, and reading order before extracting text. For simple single-column documents, Tesseract is fine; for anything with structure, use a layout-aware model.
Recommended setup for ocr / document text extraction
Browse all tools for runtimes that fit this workload.
Reality check
Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.
Common mistakes
- Buying for spec-sheet VRAM without modeling KV cache + activation overhead
- Underestimating quantization quality loss below Q4
- Skipping flash-attention support (real perf gap on long context)
- Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)
What breaks first
The errors most operators hit when running ocr / document text extraction locally. Each links to a diagnose+fix walkthrough.
Before you buy
Verify your specific hardware can handle ocr / document text extraction before committing money.
OCR and document-understanding workloads use vision-language models — the buyer math is different from text-only LLM shopping.