EXPLAINER · CONCEPTUAL · revision 01

Local AI, in simple terms

~5 minute read · for someone who has never thought about how AI works

People keep telling you that you can "run AI on your computer" now. You nod. You don't know what that means.

This page explains it. The vocabulary is simple. The analogies are real. By the end you should understand what the AI actually is and why running it on your laptop is even possible. After that, head to /start for the install instructions.

§ 1

What is an "AI model"

An AI model is a file. That's the surprising part. It's not a website, not a service, not a black box — just a file sitting on a hard drive. You can copy it, email it, put it on a USB stick. It's usually somewhere between 2 and 80 gigabytes.

Inside the file is a recipe for predicting the next word. The recipe was written automatically by a separate program (called "training") that read a substantial chunk of the public internet — books, Wikipedia, code repositories, scientific papers, forum posts — and figured out the statistical patterns of how humans use language.

The recipe takes the form of about 8 billion numbers (or 70 billion, or 405 billion, depending on how big a model you want). When you ask the model a question, your computer feeds your words through those numbers in a precise sequence. Out the other end comes the most likely next word. Then the next. Then the next. Eventually you have a complete answer.

That's it. There is no magic. The model never "knows" anything in the way you know your name. It does an extremely sophisticated form of autocomplete, trained on so much text that the autocomplete is often indistinguishable from real reasoning.

§ 2

ChatGPT is a restaurant. Local AI is your kitchen.

When you use ChatGPT or Claude, you're ordering from a restaurant. Their chefs (the model files) live in a kitchen (datacenter) you don't own. You send your order over the internet. They cook it. They send it back. You pay per meal. They write down what you ordered.

Local AI is cooking the same meal in your own kitchen. You downloaded the recipe (the model file). You bought the ingredients (a laptop with enough RAM). You can cook anytime, even with the wifi off. Nobody knows what you're making. You don't pay per meal — only the up-front cost of the kitchen.

The restaurant has a bigger, fancier kitchen than yours. They have recipes you can't download. For special occasions, the restaurant is still better. But for a Tuesday dinner, your kitchen is fine, and it's yours.

§ 3

Why this works on a normal computer

Two things changed in 2023–2024 that made local AI feasible.

First, models got smaller without getting much dumber. Researchers figured out how to train smaller models on smarter data — using better recipes — so an 8-billion-parameter model from 2025 is roughly as capable as a 175-billion-parameter model from 2022. We got a 20× efficiency gain in three years.

Second, quantization became standard. A model file is mostly numbers, and the numbers don't need full precision to work. Storing each number with 4 bits instead of 16 shrinks the file by 4× while losing only 1–2% of accuracy. A 70B model that used to need a $20,000 datacenter card now fits on a consumer laptop.

The combined effect: a free model you can download today produces answers that would have required a billion-dollar research lab five years ago. Most of the cost moved to the up-front training run, which a few large organizations pay once. After that the model file is just bytes anyone can copy.

§ 4

Vocabulary you'll hear

These come up constantly. One-line analogies for each. The /glossary has the technical detail; this is just enough to follow a conversation.

parameters: The number of dials inside the model. More dials usually means better answers. "8B" means 8 billion dials. The model is named after this number.
token: A chunk of text the model thinks in. Roughly ¾ of a word in English. The model produces text one token at a time. When someone says "tokens per second," that's the speed.
context window: How much text the model can see at once. 128K tokens ≈ a 300-page book. Beyond that, the model forgets the start of the conversation.
quantization: Storing the model's numbers in less space. Q4 is 4 bits per number, FP8 is 8 bits, FP16 is 16. Smaller = faster + less memory + slightly worse answers.
VRAM: The fast memory inside a graphics card. The model has to fit in here for the GPU to actually accelerate it. 24 GB VRAM is the sweet spot for serious local work.
runtime: The program that actually reads the model file and runs it. Ollama, llama.cpp, LM Studio, vLLM are the popular ones. Same model file, different runtimes — pick whichever is easiest.
fine-tuning: Taking an existing model and teaching it a few thousand more things specific to your work. Like hiring a great chef and showing them your favorite recipes.
RAG: Letting the model read your documents at question time, instead of trying to teach it about them ahead of time. The way to make a generic model answer questions about your specific data.
agent: A model in a loop, given the ability to use tools (read files, run commands, browse the web). Instead of one answer, you get many small actions toward a larger goal.
hallucination: When the model makes something up that sounds correct. All models do this. The trick is to verify anything the model says that matters.

§ 5

Should you trust what it says?

No. Not without checking.

The model is doing autocomplete on patterns it learned from text. It will produce things that sound right but are wrong, including fake citations, made-up function signatures, dates that don't match reality, and confidently-stated facts that are just fabrication. There's no internal fact-checker. The model doesn't know what it doesn't know.

The bigger and more recent the model, the less this happens — but it never goes to zero. Treat outputs as a strong first draft. Verify anything that matters: code by running it, facts by checking primary sources, advice by asking a person who knows.

This applies equally to ChatGPT, Claude, Gemini, and the model you just downloaded. Local versus cloud doesn't change the honesty problem; it's a property of how all current models work.

§ 6

Now what

If you have a laptop with 16 GB of RAM or more, you can run a competent model right now. The setup takes about five minutes. See /start.

If you want to know which model to pick, see /models. If you're trying to figure out what hardware to buy, see /hardware. If you want to know what local AI can do at all, see /tasks. If you have a specific machine and want to know what runs on it, see /will-it-run.

If you want to know the technical detail behind any term, see /glossary. If you want to keep up with what's changing, see /pulse.

EXPLAINER · LAST REVISED 2026-05 · runlocalai.co/explained