Beginner: run your first local model
For: Anyone with a Mac or Windows laptop and zero local-AI experience. By the end: A 7B-class model running locally, an OpenAI-compatible endpoint pointed at it, and the vocabulary to read benchmark pages.
You're on a Windows or Mac laptop, you've heard about local AI, and you don't have a 4090 sitting in your closet. That's fine — the goal of this path is to get a small, real model running on what you already own, then teach you the vocabulary you need to decide whether to invest in more hardware. By milestone seven you will have a 7B-class model running through an OpenAI-compatible endpoint, the ability to read a benchmark page without panic, and a working sense of which quantization to pick.
Install Ollama and load a 3B model
Ollama is the easiest first install. One binary, one command to pull a model, and a working REPL. On a 16GB laptop you can comfortably fit a 3B-class model like Llama 3.2 3B or Qwen 2.5 3B — these are not toys, they handle simple summarization, classification, and small Q&A reliably.
Don't reach for the 70B model. Don't try Mixtral. The point of this milestone is to get a successful end-to-end run and confirm your machine is capable of running anything at all. If Ollama itself can't load a 3B model in Q4, the rest of the path will not save you.
Read your first benchmark page honestly
You will spend the rest of your local-AI life reading benchmark pages and you should learn to read them now. The two things to internalize: tokens-per-second is workload- dependent (prompt size, batch size, quantization, runtime), and a "B" in the model name is parameter count, not memory. A 7B model in FP16 needs ~14GB; in Q4 it needs ~4GB. Same model, totally different deployment story.
Before continuing, browse three benchmark pages on /benchmarks and identify the runtime, the quantization, and the hardware on each. If you can't, re-read this milestone. Don't skip — every later milestone assumes you can.
Move up to a 7B model and feel the cost
Now pull a 7B-class model: Llama 3.1 8B Instruct or Qwen 2.5 7B. Watch what happens when you ask it for a 500-word answer. On most laptops without a discrete GPU, you will see something between 5 and 15 tokens per second. That's slow enough to feel, fast enough to use for small focused tasks.
This is the milestone where most people decide whether local AI is for them. If 8 tok/s is fine for what you want to do (drafting, classification, code review on small files), you have your answer. If it isn't, you have learned a real fact about your hardware.
Understand quantization without the math
Every model on Hugging Face ships in many sizes — Q4_K_M, Q5_K_M, Q6_K, Q8_0, FP16. These are quantization formats and they trade memory for accuracy. Don't memorize the formulas; memorize the heuristic: start at Q4_K_M, move up to Q5_K_M if outputs feel dumb, move up to Q8 if you have the memory and accuracy matters.
The VRAM calculator gives you the math when you need it. For now, the rule is: pick the largest quant that leaves you 2-4GB of headroom on top of the model itself.
Switch from Ollama to llama.cpp directly
Ollama is a wrapper around llama.cpp. Once you've gotten comfortable, learning to run llama.cpp directly buys you real control: you can pick exact build flags, point at quants Ollama doesn't ship, and read meaningful error messages when things break. This is the upgrade from "operator who can run a model" to "operator who can debug their stack."
The OpenAI-compatible HTTP server on port 8080 is the interface. Every editor extension, every agent, every framework speaks that protocol. You are now plug-compatible with the rest of the local-AI ecosystem.
Connect a real client to your local endpoint
Pick one front end. Open WebUI gives you a ChatGPT-style web app pointed at your local server. LM Studio gives you a polished desktop app with a model browser. Both work; pick the one that fits your habits. The point isn't the front end — it's that you've now run an end-to-end local AI stack and used it for something real.
Decide what's next, with eyes open
You've now run small models on what you have. You know what 8 tok/s feels like, you know which use cases your hardware can serve, and you know what the upgrade options are. If you want bigger models or higher throughput, the next path is hardware-shaped: a discrete GPU (the GPU chooser is the right next stop) or a higher-memory Mac.
Or: stop here. The 7B model on your laptop is a real tool for plenty of tasks. Knowing when to stop is its own operator skill.
Next recommended step
Now that you've felt your laptop's ceiling, the chooser walks you through picking a GPU based on the workload you've actually validated.