Beginner guide · Tooling

Free AI tools that run on your computer

The 8 tools that matter for running AI locally and cost zero dollars. Ollama, LM Studio, llama.cpp, GPT4All, Open WebUI, AnythingLLM, Continue.dev, Aider — what each one does, what it runs on, what breaks first, and which combinations work together.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,800 words

What “free” means here

Every tool on this page is open-source or has a free desktop tier with no usage cap, no telemetry forced on by default, and no monthly bill. They all run on consumer hardware. None of them require an OpenAI key, an Anthropic key, or any cloud account at all (though several can optionally talk to cloud APIs if you want hybrid setups). The only cost is the electricity to run your own GPU.

We're not listing every tool that exists — that would be a directory, not a guide. The full directory is at /tools. This page covers the eight that 90% of local-AI users actually settle on, the ones that solve clearly different problems, and how they fit together.

Ollama — the default runtime

Ollama is the closest thing local AI has to “the obvious starting point.” It is a small command-line program plus a model registry: you type ollama pull llama3.1:8b, the model downloads in the right quantization for your machine, and ollama run llama3.1:8b drops you into a chat session. It speaks an OpenAI-compatible HTTP API on localhost, which means most other tools (Open WebUI, Continue.dev, AnythingLLM) talk to it without configuration.

What it does: downloads, manages, and serves open-weight LLMs with sane defaults.
What's free: all of it. No paid tier exists.
OS support: macOS, Linux, Windows.
Hardware floor: 8 GB RAM for tiny models; 6 GB+ VRAM for comfortable use.
What breaks first: Windows + AMD GPUs. ROCm support on Windows is improving but still drops to CPU silently in many builds. If you're on Windows AMD, see /errors before you blame yourself.

Ollama's honest weakness: it abstracts too much for power users. You don't directly pick the quant; you can't easily run two GPUs in tensor-parallel; advanced sampling parameters are buried. For 80% of users that's a feature. For the other 20%, drop down to llama.cpp directly.

LM Studio — the GUI for non-developers

LM Studio is what you give a friend who wants to try local AI but is not going to open a terminal. It's a single desktop app that includes a model browser (with quality / size labels), a chat window, a server mode for local APIs, and a hardware-aware quant recommender. Same llama.cpp engine under the hood as Ollama; different audience.

What it does: graphical model search, download, chat, and local API server.
What's free: personal use is free. Commercial use requires a separate license (paid).
OS support: macOS, Windows, Linux.
Hardware floor: same as Ollama.
What breaks first: the model browser sometimes lists quants that won't fit on the user's hardware. The hardware-fit estimate is approximate, not authoritative.

llama.cpp — the engine under everything

llama.cpp is the C/C++ inference engine that Ollama, LM Studio, GPT4All, Jan, KoboldCpp, and a dozen others embed under the hood. It's the reason any of this works on consumer hardware — Georgi Gerganov's team has done the unglamorous work of implementing every quantization kernel for every backend (CUDA, Metal, ROCm, Vulkan, SYCL, CPU AVX-512, ARM NEON).

What it does: raw inference of GGUF-format models with maximum hardware coverage.
What's free: MIT-licensed, all of it.
OS support: everything that has a C++ compiler.
Hardware floor: a Raspberry Pi 4 can run a 1B model at 2-3 tok/s. There is essentially no floor.
What breaks first: the user experience. You build it from source, you write your own command line, you handle your own model files. Most people should use Ollama or LM Studio and let llama.cpp work invisibly underneath.

Use llama.cpp directly when: you need a quantization the higher-level tools don't expose, you're embedding inference into your own application, or you're debugging why something doesn't work in the higher-level tool.

GPT4All — the all-in-one desktop app

GPT4All is Nomic AI's desktop app that bundles a runtime, a model browser, a chat UI, and a built-in document-Q&A feature into a single installer. It's the most “just works” option for users who want one app to do everything without combining four tools.

What it does: chat + local document RAG + model management in one app.
What's free: all of it for personal use.
OS support: macOS, Windows, Linux.
Hardware floor: 8 GB RAM, no GPU required (though slow without one).
What breaks first: the curated model list is smaller than Hugging Face — you may not find the latest checkpoint for a week or two after release. The document-RAG quality is functional but not as good as a dedicated tool like AnythingLLM.

Open WebUI — the ChatGPT-style frontend

Open WebUI is the closest free clone of the ChatGPT web UI. It runs as a web server on your machine, talks to Ollama (or any OpenAI-compatible endpoint), and gives you conversation history, multi-model comparison, file attachments, web search integration, and user accounts. Most operators who run Ollama eventually pair it with Open WebUI.

What it does: web-based chat frontend over local model servers.
What's free: all of it (BSD-3 licensed).
OS support: runs in Docker (any platform) or pip (Linux/macOS); browser is the client.
Hardware floor: the UI itself is light; the inference cost is whatever Ollama needs.
What breaks first: Docker disk usage. Open WebUI in Docker accumulates layers; docker system prune is your friend. See /systems/local-ai-maintenance.

AnythingLLM — chat-with-your-documents

AnythingLLM is the “upload your PDFs / Notion export / website and ask questions” tool, fully local. It builds an embedding index, stores it in a local vector database, and runs retrieval-augmented generation against your chosen LLM. Workspaces let you keep different document sets separate.

What it does: local RAG over your own documents, with a polished UI.
What's free: the desktop app and self-hosted version are MIT-licensed.
OS support: macOS, Windows, Linux.
Hardware floor: 8 GB RAM works for small document sets; embedding large corpora benefits from a GPU.
What breaks first: embedding throughput on CPU-only setups for large document sets is slow — initial ingestion of 10K pages can take hours. Use a GPU for ingestion if you have one.

Continue.dev — the IDE coding assistant

Continue.dev is the open-source equivalent of GitHub Copilot or Cursor that runs against your local model. It plugs into VS Code and JetBrains IDEs, gives you in-line autocomplete, chat-with-codebase, and slash commands — all routed to your local Ollama instance or any OpenAI-compatible endpoint.

What it does: IDE autocomplete + chat backed by your local model.
What's free: open-source (Apache 2.0).
OS support: wherever VS Code or JetBrains run.
Hardware floor: autocomplete needs a small fast model (Qwen 2.5 Coder 1.5B or DeepSeek-Coder 6.7B). 6 GB+ VRAM strongly recommended for usable latency.
What breaks first: the autocomplete latency on CPU-only setups is too high to be useful. The chat side works fine on CPU; the inline completion does not.

Aider — the terminal pair-programmer

Aider is a terminal-based AI pair programmer. You point it at a git repository, describe what you want, and it edits files, runs your tests, and commits. Originally built for cloud APIs, it now works fluently with local models via Ollama or any OpenAI-compatible endpoint.

What it does: AI-driven multi-file edits with git integration, in the terminal.
What's free: open-source (Apache 2.0).
OS support: any platform with Python 3.10+.
Hardware floor: for usable agentic editing, 12 GB+ VRAM with a 14B+ coder model. Smaller models work but make more mistakes.
What breaks first: small local models (under 7B) struggle with Aider's diff-based edit format. They produce edits that don't cleanly apply, and you spend more time fixing than coding. Use a 14B+ coding-tuned model.

How they fit together

These eight tools are not competitors — they're a stack. The minimal useful combination for most people is:

Ollama as the inference server.
Open WebUI as the chat frontend.
Continue.dev for in-IDE coding.

That's ChatGPT-equivalent + Copilot-equivalent, fully local, $0 in software cost. Add AnythingLLM if you want to chat with your own documents. Swap to LM Studio if you don't want a terminal. Swap up to llama.cpp directly if you need fine control. Add Aider when you want agentic editing.

What breaks first

The same three categories of failure show up across all of these tools:

Hardware doesn't match the model you picked. The runtime falls back to CPU or swaps to disk; tok/s drops to 1-3; you blame the tool. The fix is always the same: check VRAM math at /will-it-run/custom first, model-pick second.
Driver / runtime version drift. A Linux distro auto-updates the NVIDIA driver, or Windows Update bumps the WSL2 kernel, and suddenly the GPU isn't detected. See /errors for the standard fixes.
Quantization mismatch. The tool downloads a Q2 or Q3 quant by default to fit limited hardware; output quality is poor; the user assumes “local AI is bad.” Insist on Q4_K_M or higher whenever VRAM allows. See /systems/quantization-formats.

Adjacent reading: can my computer run any of this?, what hardware should I buy?, and the full /tools directory.

The tools themselves cost nothing, but they all consume the same finite resource — your GPU. A tool that demands 16 GB of VRAM to stay responsive does not care whether you paid for the software. Knowing which GPU tier each category of tool expects lets you filter the list to what your machine can actually drive at interactive speed instead of installing everything and hoping for the best.

The GPU tier that makes free tools actually usable: best budget GPU for local AI.