BLK · QUESTIONS · /Q/

Questions

23 direct answers to the questions Reddit, HN, and forum operators actually ask. One paragraph each. Source-cited. Linked to the live tool that computes the answer for your specific stack.

HOW THIS WORKS

Each entry is a short-link landing page targeted at a specific forum-search query. The format is rigid by design: a one-paragraph answer that doesn't hedge beyond what the data warrants, a source-of-record citation, a deep-link into the tool that computes the answer for your stack, and cross-references to related editorial. We update these pages as new data lands.

TOPIC · FINE-TUNING

Fine Tuning

2 questions

TOPIC · PRIVACY

Privacy

1 question

I want my AI conversations to stay private — what's the realistic local-first setup?
Privacy-first local AI migration — what stays local, what you give up, the realistic hardware path, and the apps that don't phone home.
privacylocal-firstair-gappedclaude-migration

TOPIC · INFERENCE-STACK

Inference Stack

1 question

What's the actual stack from CUDA to my chat UI? Where does each piece fit?
The 5-layer local LLM inference stack — CUDA / Metal / ROCm at the bottom, llama.cpp / TensorRT-LLM in the middle, Ollama / vLLM as serving runtimes, OpenAI-compat API on top, apps consuming it.
inference-stackcudatritonvllm

TOPIC · KV-CACHE

Kv Cache

1 question

Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?
KV cache vs RAG for local LLM apps — when prefix caching beats retrieval-augmentation, the latency math, and the operator decision rule.
kv-cacheragvllmprefix-caching

TOPIC · LATENCY

Latency

1 question

What's the real latency tax of cloud LLM APIs vs running locally?
Local LLM vs cloud API latency — the honest 200-1000ms tax cloud adds, when it matters, when it doesn't, and the local-first decision rule.
latencyttftlocal-vs-cloudagents

TOPIC · CLOUD-PRICING

Cloud Pricing

1 question

My cloud LLM provider just changed pricing. What are my local options?
When cloud LLM pricing shifts and you're feeling the squeeze — here's the realistic local migration path by workload (coding, chat, RAG, voice).
cloud-pricingmigrationlocal-aiclaude

TOPIC · 4GB-VRAM

4gb VRAM

1 question

Is 4GB VRAM still viable for local AI in 2026?
4GB VRAM in 2026 — what you can run, what you can't, and the realistic use cases. RX 570 to GTX 1650 owners: here's the honest math.
4gb-vramsmall-modelsbudgetrx-570

TOPIC · VOICE

Voice

1 question

How do I build a fully-local voice-to-voice pipeline?
Local voice-to-voice in 2026 — Whisper for STT + local LLM + Piper for TTS. The realistic latency, hardware requirements, and the wiring guide.
voicewhisperpipertts

TOPIC · NVFP4

Nvfp4

1 question

Is NVFP4 a game-changer? What is it, and does it matter for me?
NVFP4 explained — NVIDIA's 4-bit floating-point format, why it claims 75% lossless compression, which hardware supports it, and whether the hype is real.
nvfp4blackwellrtx-5090quantization

TOPIC · HOSTING

Hosting

1 question

What's the difference between 'hosting' and 'serving' a local LLM?
Hosting (Ollama) vs serving (vLLM) — what each one optimizes for, when each one fails, and the runtime decision rule for solo vs production deployments.
hostingservingollamavllm

TOPIC · RTX-3090

Rtx 3090

1 question

Dual RTX 3090 vs single RTX 5090 — which one for local AI?
Two used RTX 3090s ($1,200) vs a new RTX 5090 ($1,999) for local AI. VRAM, bandwidth, software complexity, TDP, the actual decision matrix.
rtx-3090rtx-5090multi-gpu70b-models

TOPIC · LLAMA-3-3-70B

Llama 3 3 70b

1 question

What's the best GPU for running Llama 3.3 70B locally?
Llama 3.3 70B needs ~42GB VRAM at Q4_K_M with 8K context. The realistic GPU options: dual 3090, RTX 6000 PRO Blackwell, M3 Ultra. The honest tradeoffs.
llama-3-3-70brtx-3090rtx-6000-promac-studio

TOPIC · QWEN-3-6

Qwen 3 6

1 question

Qwen 3.6 (35B-A3B / 27B with MTP) vs Qwen 3 32B — should I upgrade?
Qwen 3.6 35B-A3B with MTP vs Qwen 3 32B — the throughput uplift, the runtime requirements (vLLM 0.20+ / llama.cpp post-b9148), and when to wait.
qwen-3-6qwen-3-32bmtpvllm

TOPIC · MTP

MTP

1 question

Which local AI runtimes support Multi-Token Prediction (MTP)?
MTP support matrix as of May 2026 across vLLM, llama.cpp, Ollama, MLX. Why some runtimes get the throughput gain and others don't.
mtpvllmllama-cppollama

TOPIC · DISTRIBUTED

Distributed

1 question

Can I distribute local LLM inference across multiple machines (P2P)?
Distributed inference for local LLMs across a small fleet — what works (vLLM tensor parallel, MLX distributed, exo, Petals), what doesn't, and the network-overhead reality.
distributedmlxvllmpetals

TOPIC · COST-DISASTER

Cost Disaster

1 question

How do I prevent a $30K cloud LLM bill from a runaway agent?
Real story: a developer got a $30K AWS bill from a misconfigured agent loop. The five controls that prevent it — rate limits, budgets, local-first defaults, observability, kill switches.
cost-disasterbudget-alertsobservabilitykill-switch

TOPIC · JETSON

Jetson

1 question

Can the Jetson Orin Nano Super run local LLMs usefully?
Jetson Orin Nano Super for local AI: 8GB VRAM, 60-100 GB/s memory bandwidth, ARM Linux. What it can run, what it can't, and the realistic use cases.
jetsonedge-aismall-modelsrobotics

TOPIC · RUNTIMES

Runtimes

1 question

Ollama vs llama.cpp vs vLLM — which runtime should I use?
Ollama vs llama.cpp vs vLLM compared: when each one wins, when each one fails, and the runtime-decision rule that actually fits your stack.
runtimesollamavllmllama-cpp

TOPIC · CODING-AGENTS

Coding Agents

1 question

What's the best coding agent for local models (Ollama / llama.cpp)?
Cline, Aider, Continue, Tabby, Twinny — the 5 coding agents that work against local models. Editorial verdict + when to pick each one.
coding-agentsaiderclinecontinue

TOPIC · QUANTIZATION

Quantization

1 question

Which quantization should I use for coding agents (Aider / Cline / Continue)?
Coding agents chain 8-12 model calls per edit. Small quant errors compound. The honest Q4 vs Q5 vs Q6 vs Q8 recommendation for agent workloads.
quantizationcoding-agentsqwen-coderq6_k

TOPIC · RAG

RAG

1 question

Why doesn't my local LLM have web search — and what are the actual offline alternatives?
Local LLMs don't have web search by default — here's why, what realistic alternatives exist, and how to wire up offline RAG or air-gapped web tooling that actually works.
ragofflineweb-searchagents

TOPIC · QWEN-3

Qwen 3

1 question

Q4 vs Q6 on Qwen 3 32B — is the quality gap big enough to matter?
Q4_K_M vs Q6_K on Qwen 3 32B: the honest perplexity delta, when it matters, and when it doesn't. Quality curve + VRAM fit visualization included.
qwen-3quantizationq4_k_mq6_k

MISSING A QUESTION?

The /q/ set grows when a question trends on Reddit, HN, or in our inbox. If you've seen a thread that deserves a landing page, open a GitHub issue with the question + the thread link.