Legal · Privacy-First

Local AI for lawyers

Local AI for lawyers without the privilege risk: client-document RAG on hardware you control, deposition transcription, and the ABA-aligned ethics caveats every legal practice needs.

By Fredoline Eruo · Last reviewed 2026-05-08 · ~2,200 words

Answer first

For lawyers and paralegals, local AI is not a cost-saving measure — it is a compliance requirement. The core legal workflows that benefit from AI assistance — document review and summarization, deposition transcription, legal research RAG over case files — all involve material that is covered by attorney-client privilege, work-product doctrine, or client-confidentiality obligations. Pasting that material into a cloud AI service, regardless of the vendor's data-use policy, is a disclosure to a third party. Local inference on hardware you control eliminates that disclosure event entirely, keeping privilege intact and giving you an auditable property: no client data ever left the machine.

The hardware that makes this work: a Mac mini M4 Pro with 48 GB unified memory (silent, always-on, fits in a law-office bookshelf) or a desktop with a 24 GB GPU running Ollama and Llama 3.3 70B at Q4 for document RAG and summarization. This page covers the full local legal stack — from the ethical boundaries to the specific hardware and the retention policies that make it defensible if challenged.

Why local — privacy is the entire reason

Three reasons, all of which trace back to the same operational requirement: client data must not leave the firm's control.

Attorney-client privilege. Privilege protects communications between attorney and client from disclosure. Uploading privileged material to a cloud AI service — even one that claims not to train on enterprise data — is a disclosure to a third party that can, in some jurisdictions, waive the privilege. The cloud vendor's data policy is a contract between you and them; it does nothing to preserve privilege against an adversary who argues the disclosure itself was the waiver. Local inference avoids this risk entirely: the data is processed on hardware under the firm's control, with no third-party access. This is not a theoretical concern — bar associations in multiple states have issued opinions warning about cloud AI and privilege, and the ABA's guidance on generative AI explicitly flags this risk.

Work-product doctrine. Documents prepared in anticipation of litigation receive qualified protection from discovery. Pasting draft briefs, litigation strategy memos, or expert-work-product summaries into a cloud AI tool creates a record on a third-party server — a record that could, in some circumstances, be discoverable. Local processing eliminates the third-party server from the threat model.

Regulatory and client-consent obligations. Many clients, particularly corporate clients with sophisticated legal departments, now include AI-use provisions in engagement letters that either prohibit cloud AI outright or require specific disclosure and consent. A local stack lets you answer “yes, we use AI for document review — it runs on hardware in our office and no data leaves our network” without the qualification and risk-assessment dance that accompanies any cloud-tool disclosure.

What local AI can realistically do in a legal practice

Honest capabilities, focused on the workflows that local models handle well today.

Document RAG over case files. AnythingLLM with a local vector store (pgvector or Qdrant) ingests the case-file PDFs — complaints, answers, discovery responses, prior motions, key exhibits — and lets you query “what did the plaintiff state about damages in the second amended complaint?” with citations to specific pages. A 70B model at Q4 handles 50-200 page corpora comfortably; for 500+ pages, chunking strategy and retrieval quality become the bottleneck more than the model. The model is constrained to your documents; it does not hallucinate case law because it is not searching the internet.

Deposition transcription. Whisper large-v3 running locally transcribes a deposition recording in 3-8 minutes per hour of audio on a GPU, with speaker-labeled output if paired with a diarization model. The transcript and the audio file never leave your hardware — a meaningful difference from cloud transcription services that retain audio for 30+ days by default. Deposition transcripts are frequently privileged or covered by protective orders; keeping them local is the conservative path.

First-pass summarization of filings and correspondence. A 70B model summarizes a 30-page motion, a set of interrogatory responses, or a long email chain into a one-page brief with key points and action items. The output is a draft you read and correct — the model did the structural work of identifying sections and extracting the core arguments; you did the legal judgment of whether the summary is accurate and complete.

What it cannot do

Legal research is not the same as document RAG. A local model with RAG over your case files searches your documents. It does not search Westlaw, LexisNexis, or PACER. It does not Shepardize citations. It does not know whether a case was overturned last week. For legal research that requires current, comprehensive case-law search, the paid legal-research platforms are not replaceable by a local LLM. The model is a document assistant, not a legal-research service.

The output is not legal advice — and you are ethically obligated to know that. A local LLM generates text that reads like legal analysis. It is not legal analysis. It was not trained on your jurisdiction's specific case law. It does not know the judge. It cannot exercise professional judgment. Every output must be reviewed, verified, and owned by a licensed attorney. Presenting AI-generated output as legal work product without substantive attorney review is a professional-conduct risk in every jurisdiction that has opined on the question. This is part of our editorial policy and is the single most important paragraph on this page.

Local AI does not make your practice compliant by itself. Running AI locally eliminates the third-party-disclosure vector. It does not eliminate the need for competence (Model Rule 1.1), communication with the client about AI use (Model Rule 1.4), or supervision of non-lawyer assistants using the tool (Model Rule 5.3). The technology is one piece of a compliance program; the compliance program is your professional obligation.

Best models for legal work

Llama 3.3 70B Instruct — the primary model for document RAG, summarization, and drafting. At Q4_K_M it requires ~40-44 GB VRAM; at Q2 it fits in 24 GB with a quality trade-off on long-form legal analysis. The instruction-following quality matters more for legal work than for general chat — structured summarization of complex filings requires the model to hold document structure, party names, and procedural posture simultaneously. The 70B class handles this; the 14B class struggles on multi-party, multi-issue documents.
Whisper large-v3 — deposition and client-meeting transcription. 95-97% English accuracy on clean speech; drops to ~90% on overlapping or heavily accented speech. The output is a draft transcript, not a certified one — treat it accordingly.
nomic-embed-text or bge-large-en-v1.5 — embedding models for the RAG pipeline. These produce the vector representations that power the “find relevant chunks from the case file” step before the LLM generates the answer. Run alongside the LLM with minimal additional VRAM overhead.

Best tools for local legal AI

Ollama — the runtime. Pull llama3.3:70b-instruct, expose the OpenAI-compatible API, and connect any chat frontend. Handles GPU offloading, context management, and model loading. The simplest path from zero to working local LLM.
pgvector — PostgreSQL extension for vector similarity search. The production-grade choice for law-firm document stores that need to persist across matters and support concurrent access. More setup than Chroma or Qdrant but integrates with existing law-firm database infrastructure.
Text Embeddings Inference (TEI) — Hugging Face's embedding server. Exposes an OpenAI-compatible embeddings API, runs on GPU, handles batch embedding of large document corpora efficiently. Pair with pgvector or Qdrant for the retrieval backend.
AnythingLLM — the document-RAG frontend. One workspace per matter, ingest PDFs and text files, query with citations. Delete the workspace when the matter closes. The simplest path to “chat with my case file” without building a custom RAG pipeline.
Open WebUI — browser-based chat frontend for general drafting and summarization outside the RAG workflow. Multi-conversation support, per-conversation memory, connects to Ollama's API.

Best hardware — silent, secure tiers for legal offices

Budget — ~$600-1,000. Used RTX 3090 (24 GB) in a used office desktop. Runs Llama 3.3 70B at Q2 with full context in VRAM; at Q4, requires partial offloading to system RAM which slows inference to 5-10 tok/s — acceptable for batch summarization, not for interactive Q&A. The fan noise under load is noticeable in a quiet office; plan for a closet or server room.
Silent serious — ~$2,500. Mac mini M4 Pro with 48 GB unified memory. Silent, fits on a bookshelf, uses 30-50W at idle. Runs Llama 3.3 70B at Q4 in the unified memory pool at 8-15 tok/s — slower than a 4090 but silent and always-on. The right choice for a solo practitioner or small firm where the machine sits in the same room as the attorney. Bonus: Apple Silicon's unified memory means you can run the LLM, the embedding model, and the vector database simultaneously without VRAM fragmentation.
Workstation — ~$3,500+. RTX 4090 (24 GB) or RTX 6000 Ada (48 GB) in a sound-dampened case. The 48 GB card runs 70B at Q4 with full context and no offloading, delivering 30-50 tok/s for interactive Q&A over large case corpora. The production tier for firms handling multiple concurrent matters with document sets in the thousands of pages.

The hardware floor for legal work is 24 GB VRAM or 48 GB unified memory — below that, the 70B model that does competent legal summarization won't fit at usable quantization. The privacy threat model that makes local worth the hardware spend is in /guides/local-ai-for-privacy.

Workflows — concrete day-to-day walkthroughs

1. Case-file RAG and first-pass review. Ingest the complaint, answer, key exhibits, and prior motions into AnythingLLM as a workspace named for the matter. Query: “summarize the plaintiff's claimed damages with citations to the complaint paragraphs.” The model returns a structured summary with paragraph references. You verify each citation against the source document. The model did the structural work of extracting and organizing; you did the legal verification. This workflow turns a 45-minute manual document review into a 10-minute verification pass.

2. Deposition transcript analysis. Transcribe the deposition audio with Whisper large-v3 locally. Feed the full transcript into the LLM with a structured prompt: “Identify every statement the witness made about the July 14 meeting. Quote the relevant transcript passages with timestamps. Flag any statements that contradict the witness's earlier interrogatory responses.” The model produces a timestamped analysis in 30-60 seconds. You read the flagged passages and compare against the interrogatories. The model identified patterns across a 200-page transcript; you made the legal determination of whether a contradiction is material.

3. Correspondence drafting and summarization. Paste the opposing counsel's 10-page letter into the LLM with: “Summarize in one page: (1) the relief demanded, (2) the legal basis cited, (3) the factual allegations, (4) the deadline. Draft a two-paragraph acknowledgment letter confirming receipt and stating we will respond within the timeframe.” The model produces a draft in 20 seconds. You review for accuracy, adjust the tone, and send. The model handled the reading and structural extraction; you handled the professional judgment and client communication.

Beginner setup — $600-1,000 entry path

The minimum viable local-AI rig for a solo practitioner or small firm testing the stack before committing to a dedicated machine.

Hardware. Used RTX 3090 ($600-750) in an existing office desktop with a 750W+ PSU. Total spend under $1,000 if the desktop exists; add $300-500 for a used office PC if not.
Install Ollama. One command on Linux, one-click installer on Windows or macOS. Pull llama3.3:70b-instruct-q2_K (fits in 24 GB) for testing; upgrade to Q4 when you add a second GPU or move to a 48 GB card.
Install AnythingLLM. Desktop app, one-click install. Create a workspace, upload a non-privileged test document set, and run a few queries to validate the retrieval quality.
Configure the retention policy. Set AnythingLLM to delete workspaces manually after matter closure. Enable full-disk encryption (BitLocker on Windows, FileVault on macOS, LUKS on Linux). Write a one-page internal policy documenting where AI is used, which models, which hardware, and the review procedure — this is the document you show the bar if asked.

The full privacy-first path with operational and policy guidance is at /paths/privacy-first. The cost math is at /guides/does-running-ai-locally-save-money.

Serious setup — $2,500+ path

The production rig for a firm that has validated the workflow and wants silent, always-on, full-quality inference.

Hardware. Mac mini M4 Pro with 48 GB unified memory ($2,200-2,500). Silent, energy-efficient, integrates into an office without a server closet. Runs Llama 3.3 70B at Q4, the embedding model, and the vector database concurrently in unified memory.
Ollama with llama3.3:70b-instruct at Q4_K_M. Full 32K context. The quality difference from Q2 to Q4 on legal summarization is real — fewer dropped party names, more accurate procedural-posture descriptions, cleaner structured output.
pgvector as the vector store. More setup than AnythingLLM's built-in store but supports multi-user access, persistent storage across matters, and proper backup. Run it on the same Mac or on a separate office server.
TEI for embedding generation. Batch-embed the firm's document corpus once; incremental updates for new matters. TEI's throughput on the M4 Pro's GPU is 500-1,000 embeddings per second depending on chunk size.
Retention and audit logging. Enable conversation logging with per-matter tagging. Set auto-delete rules: 30 days for non-matter chat, matter-close for workspaces. Document the configuration.

Common mistakes lawyers make with local AI

Assuming “the vendor doesn't train on my data” satisfies the NDA or privilege obligation. It does not. A cloud vendor is a third party regardless of their data policy. Disclosure to a third party is disclosure to a third party. The vendor's terms are a contract between you and them; they do not bind your adversary in a privilege dispute. Local inference is the only path that eliminates the third-party vector entirely. See /guides/local-ai-for-privacy for the full threat model.
Treating the model's output as legal work product without review. The model generates text that reads like a competent legal summary. It is not legal analysis. Every output must be reviewed by a licensed attorney who takes professional responsibility for it. The model does not know your jurisdiction, your judge, your case strategy, or the ethical rules that govern your practice. You do.
Not documenting the AI stack for the client file. If a client or a court asks how AI was used in a matter, “I ran it on my computer” is not a sufficient answer. Document: which model, which runtime, which hardware, which documents were ingested, which outputs were produced, and who reviewed each output. This is the same documentation standard you would apply to any expert-assistant or paralegal work product.
Underinvesting in the RAG pipeline and blaming the model. A 70B model with a poorly configured RAG pipeline — bad chunking, overlapping chunks, no reranking, no citation mapping — produces summaries that miss key documents and hallucinate irrelevant content. The failure is in the retrieval, not the generation. Invest time in chunking strategy and retrieval quality before concluding the model isn't fit for purpose. The RAG glossary is at /glossary/rag.
Running the rig without full-disk encryption. A local AI machine that processes privileged material is a target. If the machine is lost, stolen, or accessed by an unauthorized party, unencrypted storage means the entire case file is exposed. Full-disk encryption is the non-negotiable baseline. If you cannot state with confidence that the drive is encrypted, do not put client data on the machine.

Troubleshooting

Ollama OOM errors when loading 70B models — VRAM sizing, quantization levels, and offloading configuration for legal-sized contexts.
RAG retrieval is slow on large document sets — embedding model selection, chunk size tuning, and vector-store indexing strategies.
Whisper hangs or produces garbled output on deposition-length audio — file format, sample rate, and memory allocation fixes.
Model downloads are slow or failing over office networks — Hugging Face mirror configuration and resume strategies.

Local AI for privacy — the complete threat model for local vs cloud AI in sensitive-data contexts.
Local AI for document search — RAG architecture, chunking strategies, and retrieval-quality tuning.
Local AI for freelancers — NDA-compatible local AI workflows for independent professionals.
Local AI for research — paper RAG, literature synthesis, and reproducible local AI for academic work.

Next recommended step

What local AI actually protects against — and what it doesn't — in operator detail.

Read the privacy threat model

OrWalk the privacy-first path Read the editorial policy