Operations · Anti-patterns

Common local AI mistakes — operations edition

The general operations counterpart to the benchmarking-mistakes list. Twelve mistakes operators repeatedly make running local AI in 2026: wrong model on wrong VRAM, no quantization understanding, ignored context budget, untested tok/s, FP16 vs Q4 mixing, treating estimates as benchmarks, sampler chaos, unpinned runtime versions, no observability, forgotten power cost, frontier-quality expectations from 7B, and missing safety constraints. Each with the actually-correct alternative.

By Fredoline Eruo · Last reviewed 2026-05-08 · ~1,700 words

Scope split — what this list is and isn't

This guide is the general-operations companion to two existing pieces. /guides/local-ai-benchmarking-mistakes covers measurement-specific anti-patterns — batch mismatch, quant mismatch, cherry-picking, cohort drift. /guides/common-local-ai-setup-mistakes covers initial-deployment errors at install time. This list covers everything else: the day-to-day operational mistakes that show up after you've already shipped a working local AI stack and are running it in 2026 with users on it. Different scope, deliberately non-overlapping.

1. Running 70B Q4 on a 12 GB card

The single most common “why is this so slow” thread in operator forums. A 70B model at Q4_K_M is roughly 40 GB of weights plus 4-8 GB of KV cache at modest context — call it 45 GB total in the comfortable zone. A 12 GB card holds none of that meaningfully. What happens: llama.cpp helpfully splits layers between GPU and CPU, you load successfully, and then generation drops from 40 tok/s to 2-3 tok/s as weights stream from system RAM. Users assume the model is broken; the model is fine, the hardware is undersized.

Do this instead. Run /will-it-run before you download. The 12 GB card's comfortable ceiling is 14B-class at Q4 with modest context. If you want 70B-class, the floor is 24 GB on one card or 24 GB across multiple cards via tensor parallelism. The shopping framing is at /guides/best-gpu-for-local-ai-2026.

2. No mental model of quantization

Operators routinely treat “Q4” as a single thing, then are surprised when Q4_K_M, AWQ-INT4, GPTQ-int4, and EXL2-4.0bpw produce different sizes, different speeds, and different quality. They aren't the same scheme; they share a bit-width approximation and nothing else. Q4_K_M uses mixed precision per layer, AWQ uses different kernel paths, GPTQ has its own calibration approach, EXL2 has variable bits-per-weight per layer. Mixing them up gets you to wrong size estimates, wrong VRAM forecasts, and wrong throughput expectations.

Do this instead. Build the mental model once. The reference is /glossary/quantization and the deeper systems treatment is /systems/quantization-formats. Learn the format your runtime ships — GGUF for llama.cpp/Ollama, AWQ/GPTQ for vLLM, EXL2 for ExLlamaV2, MLX-quants for Apple Silicon — and stop generalizing across them.

3. Ignoring context-window cost

“Llama 3.1 8B supports 128K context” and “you can use 128K context on this hardware” are different statements. KV cache memory grows linearly with context length: roughly 320 MB per 1K tokens of context for a 70B FP16 model, much smaller for an 8B GQA model but still meaningful. A 24 GB card running a 70B Q4 model with 16K context spends ~5 GB on KV cache; expanding to 64K context expands KV cache to ~20 GB, which leaves no room for the model. The advertised window is a ceiling, not the working window.

Do this instead. Calculate the working window before you set --ctx-size. The KV cache formula is published; use it. Or use /will-it-run/custom which encodes the math. If you need long context on consumer hardware, look at FP8 or INT4 KV-cache quantization (vLLM and SGLang both support this) before you assume your 24 GB card can hold the published context.

4. Not measuring tok/s before committing to hardware

The shopping mistake. Operators read a benchmark blog claiming “the 4090 hits 50 tok/s on Llama 3.3 70B” and buy a 4090 expecting that number. They don't notice the benchmark used Q3_K_S (smaller and faster than the Q4 they wanted), used a 200-token prompt (TTFT looks great), and reported single-stream burst (their multi-user serving will not see this). They get the card, see 25 tok/s on their actual workload, and conclude the hardware is broken.

Do this instead. Measure the workload you actually plan to run before you commit. Rent the GPU you're considering for a few hours on Lambda, RunPod, or vast.ai; run your actual model, your actual prompts, your actual concurrency level. The cost is $5-20 and saves you from a $2,000 mistake. The methodology is at /guides/how-to-benchmark-local-ai.

5. Mixing FP16 and quantized comparisons

Comparing “the 4090 does 38 tok/s on Llama 3.3 70B” with someone else's “the 4090 does 12 tok/s on Llama 3.3 70B” without noticing one was Q4_K_M and the other was FP16-with-CPU-offload. Both numbers are correct for their cohorts. They are not the same workload. The difference is roughly 3.2× in throughput, which the writer who omitted “Q4_K_M” from their tweet is implicitly hiding.

Do this instead. When you read a tok/s number, look for the quant. If it isn't printed, treat the number as approximately untrusted. When you publish, always print the full quantization tag: not “Q4” but Q4_K_M (GGUF, llama.cpp b4120). The negative-space companion list is /guides/local-ai-benchmarking-mistakes.

6. Treating estimates as benchmarks

Operator math is full of useful estimates: “a 70B Q4 needs ~40 GB of VRAM,” “a 4090 should hit ~30 tok/s on this,” “TTFT should be under a second.” These are estimates. They are useful for planning. They are not measurements. The mistake is letting the estimate harden into a published benchmark and then defending it when reality differs by 30%.

Do this instead. Label estimates as estimates. If you publish a number that came from theoretical calculation rather than a measured run, say so. If you publish a measured number, give the runtime version, the driver, the prompt, the output token count, the median and stddev across runs. The reproduction-guide template is at /resources/reproduction-guide; the cohort-versioning approach at /resources/versioned-benchmarking.

7. Mismanaging sampler params

Local model output quality is notoriously sensitive to sampler choice. Temperature at 1.0 looks creative on Claude and chaotic on Qwen2.5; top-p at 0.9 looks fine for general chat and falls apart for code generation. Operators often run with whatever defaults the runtime ships and conclude the model is bad when the model is fine and the sampler is wrong for the task.

Do this instead. Tune samplers per task. For code generation start at temperature 0.2-0.4, top-p 0.95, repetition penalty 1.05. For creative writing temperature 0.7-0.9, top-p 0.92. For factual Q&A temperature 0.1-0.3. For agent tool-use temperature 0.0-0.2 (you want determinism, not creativity). Document the sampler config alongside the prompt; treat sampler changes the way you'd treat code changes.

8. Not pinning runtime versions

A local AI rig that worked yesterday and doesn't today, with no application-code change, is almost always a runtime upgrade. Ollama auto-updates. llama.cpp moves fast and ships breaking sampler changes between builds. vLLM has had multiple releases that changed default behavior in ways that surprised the operator. Driver updates change tok/s by 5-15% in either direction overnight.

Do this instead. Pin everything: the runtime version, the model file checksum, the driver version, the kernel version. Document them in a single file checked into source control. When something regresses, you have a known-good config to roll back to. The maintenance posture is detailed at /systems/local-ai-maintenance.

9. Running blind — no observability

Operators with a working local AI rig often have zero visibility into what's happening inside it. No request logs. No tok/s tracking over time. No GPU utilization graph. No alerting when generation latency degrades. They notice problems only when a user complains, and the time-to-diagnose is whatever fits between the complaint and the next complaint.

Do this instead. Bare-minimum observability: log every request with prompt-length, output-length, latency, and tok/s. Track GPU utilization and VRAM headroom in Prometheus or whatever you have. Add an alert when median tok/s drops more than 20% from baseline — that catches driver regressions, model swaps, and thermal events. Full systems framing at /systems/local-ai-observability.

10. Forgetting electricity cost

Most local AI cost calculations stop at the GPU price and the cloud-equivalent comparison. They forget electricity. An RTX 4090 at sustained inference draws 350-450W; a workstation system around it adds 100-200W more. At 12 hours/day average use and $0.15/kWh, that's $25-35/month — $300-420/year. At industrial rates ($0.20-0.30/kWh in some metros), it's $40-70/month. Over three years that's $900-2,500 of electricity that wasn't in the spreadsheet.

Do this instead. Include electricity in TCO. The methodology is at /guides/how-much-does-local-ai-cost; the savings comparison at /guides/does-running-ai-locally-save-money. If your local-AI-saves-money story doesn't survive a $0.15/kWh assumption, the math was always wrong.

11. Expecting frontier quality from a 7B

The conviction that “the open-weight 7B is just as good as GPT-4” persists despite four years of evidence to the contrary. In 2026 the gap between a 7B-class local model and frontier cloud models on hard reasoning, complex code, and long-form synthesis is large and well-documented. 32B closes some of it; 70B closes more; 100B+ MoE closes most of it. None of them close all of it on every task.

Do this instead. Match the model to the task honestly. A 7B is excellent for autocomplete, summarization, single-turn chat, classification. It is not excellent for multi-step reasoning or hard coding. If your evaluation finds 7B sufficient for your real workload, great — ship it. If it isn't, the answer isn't to keep prompt-engineering the 7B; it's to use the right tool. Sometimes that tool is a bigger local model; sometimes it's a cloud API; sometimes it's a hybrid.

12. Forgetting safety constraints

Local models don't come with the same guardrails cloud APIs do. When an operator builds a customer-facing local-AI product without safety constraints, they ship a model that will happily generate content the cloud APIs would refuse, with no friction layer. This isn't a model defect — it's the operator's responsibility — but it routinely surprises operators who've only ever shipped on top of a moderated API.

Do this instead. Layer the safety yourself. Input filtering on the prompt (regex for the common abuse patterns; a small classifier for harder cases). Output filtering on the generation. System-prompt-level constraints with red-team testing. Monitoring for jailbreak attempts. The local model is a raw capability; productizing it for end users requires the same safety surface that cloud providers build, just owned by you. The systems framing is at /systems/local-ai-security.

The pattern under all twelve

Each mistake in this list is the same shape: an operator extrapolating from a partial mental model into a confident decision, then being surprised when reality refuses to comply. The fix is the same in every case — slow down, verify the claim before committing money or hours, and write down the assumptions so future-you can audit them. Local AI rewards operators who measure carefully and punishes operators who guess.

Next recommended step

The measurement-specific anti-pattern list this guide cross-links.

Read the benchmarking-mistakes companion

OrRun the will-it-run sizer See the maintenance system view

Many first-time local AI builders spec a machine around a model they read about on a forum, only to discover the model has been superseded by a faster variant that needs different quantization settings. Hardware stays with you across dozens of model releases. A GPU chosen for flexibility today means you do not have to rebuild your stack every quarter when the next model architecture ships.

The hardware picks that avoid the most common regret: best GPU for Llama.

A card with adequate VRAM headroom also sidesteps the most common runtime error on this site: CUDA out-of-memory crashes during long context windows.