Local AI for privacy — what you do and don't get
An honest threat model for running AI offline. Per-runtime telemetry checklists for Ollama, LM Studio, vLLM, and llama.cpp. What model weights, tokenizer caches, RAG stores, and OS-level metadata leak even on an air-gapped box. The gap between 'local inference' and 'true privacy', and how to close it.
Answer first
Running an open-weights model on hardware you control eliminates the single biggest privacy risk of cloud AI: a third party reading, logging, training on, or being subpoenaed for your prompts and outputs. That is real, and it is the reason most operators reading this page came to local inference. What local inference does not do is silently turn your machine into a private oracle. The runtime still talks to the network sometimes, the model weights still sit in a directory anyone with shell access can read, the RAG store you build over your own documents is a far higher-value target than the model itself, and the operating system underneath is busy snapshotting clipboard contents, paging memory to disk, and indexing your home directory the entire time.
The honest framing: local inference is a privacy primitive, not a privacy product. It removes one specific category of leak. Whether the rest of your system also stops leaking depends on choices you make below the runtime — which is what this guide is about. The opinionated learning path that walks the same ground end-to-end is at /paths/privacy-first; the broader local-vs-cloud framing is in /guides/why-run-ai-locally-instead-of-chatgpt and /compare/local-vs-cloud.
What “local” actually buys you
Five concrete properties, none of which require trust in a vendor.
- No prompt egress to a third party. Your conversations are not crossing a network boundary on the inference path. A cloud provider cannot log, retain, train on, or be compelled to disclose what they never received.
- No output egress to a third party. Same property in the other direction. The model's response stays on the same machine that sent the prompt.
- No conditional access. The provider can't change the safety policy, raise the price, deprecate the model, or rate-limit you. The weights you have are the weights you have.
- Reproducibility. Same prompt, same seed, same weights, same machine produces the same output. This is a property cloud APIs do not give you and one auditors increasingly want.
- Air-gap-able. A correctly configured local stack runs with the network unplugged. Cloud cannot make this claim, even with VPC isolation.
These are real and they are why the privacy-first path exists. None of them constitute “privacy” on their own.
What a local stack still leaks
Six categories operators routinely miss when they assume “local” means “private.”
1. Runtime telemetry and update checks. Some runtimes phone home for version checks, crash reports, or anonymous usage metrics. Even when these payloads contain no prompt content, they reveal that you are running the runtime, what version, and what model file path. This is fingerprinting-grade metadata. Section below has the per-runtime breakdown.
2. Model and tokenizer downloads. Pulling a model from Hugging Face or Ollama's registry is an HTTPS request to a server that knows your IP address, the model name, and the timestamp. The download itself is encrypted; the fact that the download happened is not.
3. The RAG store. If you indexed your client work, your medical records, or your tax docs into Chroma or Qdrant, the resulting vector database is structurally a high-value target. It is on disk, in the clear by default, and trivially queryable by anything with read access. The model is a generic 8 GB file; the RAG store is your life.
4. OS-level captures. Clipboard managers, OS-level recents lists, recently-used-files menus, screenshot tools, indexer services (Spotlight, Windows Search), and accessibility logs all observe content as you work with it. The runtime does not know to hide from them.
5. Swap and hibernation. When physical memory is exhausted, the operating system writes pages — including pages containing prompt and response text — to a swap file or hibernation image on disk. On reboot or seizure, those files persist.
6. Network-side observers, even on localhost. A runtime that exposes an HTTP server on localhost:11434 is reachable by every process on the machine. A browser tab, a malicious dependency, or any sandboxed code with network permission can talk to it. Localhost is “not on the network” only from a specific viewing angle.
Telemetry by runtime
What each major runtime sends out of the box, current as of May 2026. Verify on your own machine — runtimes update.
Ollama. The Ollama daemon checks for updates against ollama.com at startup and periodically thereafter. Model pulls hit registry.ollama.ai. There is no documented prompt or output telemetry, but the binary itself is closed about reporting metrics — the absence of evidence is not evidence of absence. To minimize: set the env var OLLAMA_NOPRUNE=1, block ollama.com at the firewall once your models are pulled, or run the daemon in a network namespace with no egress.
LM Studio. Closed-source GUI that ships with anonymous analytics enabled by default — turn off in Settings → Privacy. Update checks hit lmstudio.ai. Model browser pulls from Hugging Face. Once configured offline, the inference path itself does not appear to phone home, but the “closed-source” caveat above applies. Operators with strict requirements typically prefer Ollama or llama.cpp on grounds of source-availability alone.
vLLM. Open-source. Does not phone home in default configurations. Hugging Face downloads still hit HF when you reference a remote model. The biggest privacy footgun in vLLM is the OpenAI-compatible API server bound to 0.0.0.0:8000 by default in some setups — bind to 127.0.0.1 explicitly unless you specifically want LAN access, and when you do, put it behind Open WebUI with auth or behind a reverse proxy.
llama.cpp. Open-source, single binary, no network code on the inference path at all. Once the GGUF is on disk, the binary will run forever with the network unplugged. This is the lowest-trust runtime by construction and the right pick for an air-gapped operator.
Model weights, tokenizers, and the cache directories
Where the data sits matters. Default locations:
- Ollama: models in
~/.ollama/modelson macOS/Linux,%USERPROFILE%\.ollama\modelson Windows. - Hugging Face cache:
~/.cache/huggingface/hub/by default. Used by vLLM, transformers, and most Python-side tools. - llama.cpp: wherever you put the GGUF. There is no default cache.
- LM Studio:
~/.cache/lm-studio/models/on Linux, equivalent dirs on macOS/Windows.
These directories sit in the clear on a normal disk. Anyone with read access on your account, anyone with root, anyone with physical access to the disk after a power-off, and any file-syncing service that's configured against your home directory (Dropbox, iCloud Drive, OneDrive) can read or copy them. For sensitive operators: put the model cache on an encrypted volume that is unlocked only when the runtime is running. On macOS, FileVault covers the whole disk; on Linux, LUKS does the same; on Windows, BitLocker is the equivalent.
RAG stores are a bigger privacy surface than the model
Most operators worry about the model file. The model is a 4-8 GB blob of weights that any other operator already has. The interesting target is what you put into a vector store. RAG over your client work, your therapy notes, or your tax records means an embeddings database — typically Chroma, Qdrant, or SQLite-vec — sits on disk holding paragraph-by-paragraph chunks of your most sensitive documents alongside their embedding vectors.
Two specific risks to take seriously. First, the database files are unencrypted by default; they are normal files in normal directories with normal read permissions. Second, the chunks are typically stored as raw text alongside their embeddings — the embedding is not a cryptographic transform, it does not hide content, and even if it did, the raw text is right next to it. A laptop theft of a privacy-first user with an unencrypted RAG store is a data breach. Mitigations: full-disk encryption, store the vector DB on an encrypted volume, never sync the directory holding the DB to a cloud drive, and treat the DB file with the same handling discipline as the source documents that populated it. The operator-grade stack assembly is in the AnythingLLM profile and the privacy-first path.
OS-level metadata — clipboard, swap, journal, indexer
The runtime can be perfectly disciplined and the operating system underneath will still observe what you do.
- Clipboard. macOS Universal Clipboard, Windows Clipboard History, and most third-party clipboard managers persist clipboard contents for hours or days. If you copy-paste a sensitive prompt, that text is in the clipboard log even if it never hit the model.
- Swap and hibernation files. Anything in RAM during inference can be paged out. On macOS,
/private/var/vm/sleepimagecontains a snapshot. On Linux, swap is encrypted only if you configure it that way. - Indexers. macOS Spotlight, Windows Search, and Linux desktop indexers walk your home directory continuously. RAG store files, exported chat transcripts, and saved prompts all end up in the index — and the index itself is searchable by anyone with login access.
- Recents and shell history. Recently-opened-files lists in editors, jump lists in Windows, and shell history files (
~/.bash_history,~/.zsh_history) reveal what you worked with even after the files are deleted. - Browser local storage. If your local-AI frontend is a browser tab pointed at a localhost runtime, the conversation persists in IndexedDB or localStorage tied to that origin. Browser sync extensions then copy that across devices.
Logs and the retention question
Every runtime and every frontend writes some kind of log. Open WebUI persists conversation history in a SQLite database by default. AnythingLLM stores workspace conversations and the documents you uploaded. Ollama writes server logs that include prompt and response counts but not (by default) content. vLLM's server log can include prompt content depending on log level.
The privacy-first move is to set a retention policy and enforce it. In Open WebUI: configure auto-deletion of conversations after N days, or use the “temporary chat” mode that holds the session in memory only. In AnythingLLM: separate workspaces per project, delete the workspace when the engagement ends. At the OS level: a weekly cron that empties the runtime's log directory is a small piece of automation that closes a real leak.
The threat model checklist
Before deciding the privacy bar a stack hits, name the threat. The shape changes substantially.
- “I don't want a cloud vendor reading my prompts.” A correctly configured local runtime solves this. Most other categories below do not require local; they require additional discipline.
- “I don't want a cloud vendor training on my data.” Same as above. Local solves it cleanly.
- “I don't want my employer to see my AI use.” Local helps if your work device isn't already running endpoint surveillance. If it is, local AI is irrelevant — the EDR agent sees everything you type into anything.
- “I don't want a partner / household member to see my conversations.” Local does nothing here unless the disk is encrypted and the chat history is auto-deleted. The default Open WebUI install is more discoverable than ChatGPT's history page.
- “I am subject to HIPAA / legal privilege / NDA constraints.” Local is the right answer, but only as part of a stack with full-disk encryption, no cloud sync of model or RAG directories, audit logging, and contract language that allows local AI specifically. The freelancer NDA-and-contracts guide covers the contract side.
- “I am air-gapping for nation-state-grade adversary protection.” Local is necessary, not sufficient. You need llama.cpp specifically (no network code), an air-gapped machine, FIPS-grade disk encryption, and a hardware-keyed signing chain for the weights. This page is not the right reference for that bar.
Where local AI is the right answer, and where it isn't
Local inference is the right privacy answer for working professionals handling NDA-bound client work, for therapists or doctors drafting case notes, for journalists protecting sources, for students working through coursework that includes original research, and for anyone whose threat model includes “a cloud subpoena could compel disclosure.” In each of those cases the cloud option is structurally worse than local even when local is operationally less convenient.
Local is the wrong answer when the threat is a determined attacker with access to the device. A laptop running Ollama with full-disk encryption is more secure than the same laptop with no encryption talking to ChatGPT — but neither is secure against an adversary who can compel a password. Local AI is also a poor fit when the workflow specifically requires frontier reasoning beyond what 70B-class open weights deliver; in those cases the right move is hybrid (local for the 80% of routine work, cloud frontier for the 20% that needs it) with a written policy about what gets sent where. The honest cost-and-tradeoff math for hybrid setups is in /compare/local-vs-cloud; the broader operator-grade tooling map is at /guides/best-free-local-ai-tools.
The summary, said plainly: privacy is a stack property, not a runtime property. Local inference removes the loudest leak. The rest is up to you.
Next recommended step
Hardware, runtime, RAG, and OS hardening as a single end-to-end setup.
Privacy is only as strong as the hardware that enforces it. A machine with soldered storage, a locked-down OS, and hardware-backed attestation closes vectors that a commodity PC leaves open. Apple Silicon machines offer that architecture by default, which is why they dominate conversations about truly private AI workloads. The difference between privacy on paper and privacy in practice often comes down to which machine is sitting on your desk.
The hardware that makes privacy enforceable, not aspirational: best Mac for local AI.