RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Local voice assistant pipeline
Voice
Week build-out

Local voice assistant pipeline

Real-time speech-to-LLM-to-speech. faster-whisper for transcription, Llama 3.1 8B / Qwen 2.5 7B for the brain, Piper TTS for synthesis. Runs on a single 4090 or even an RTX 3060 with the smaller model. Adds VAD, wake-word, and a low-latency websocket.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint
RTX 3060 12 GB minimum · 32 GB RAM · USB or I²S microphone
Concurrency
1-3 concurrent rooms (one wake-word per microphone).
Power
~150-250 W idle-to-active.

Goal: A privacy-respecting voice assistant — the Alexa replacement that doesn't ship your audio anywhere.

Operator card

Workflow
Best for
  • ✓Smart-home enthusiasts replacing Alexa/Google Assistant
  • ✓Privacy-conscious households with kids
  • ✓Accessibility deployments (eyes-free home control)
  • ✓Anyone who already runs Home Assistant
Avoid if
  • ⚠You don't have a microphone setup (placement matters more than the rest)
  • ⚠You need multilingual real-time (Whisper handles it but Piper voices are language-pinned)
  • ⚠You're not willing to tune wake-word thresholds
Stability
stable
Maintenance
Weekly attention
Skill
Intermediate
Long-session reliability
reliable

Service ledger

6 services across 3 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
Qwen 2.5 7B Instruct
Model
Brain LLM. Strong tool-calling at the 7B size class. Fits 8 GB cards; leaves headroom for Whisper + Piper on the same GPU.
Runs: Ollama / llama.cpp
Ollama
Inference
11434/tcp (loopback)
Inference engine. One-line setup; OAI-compatible endpoint; the right tool for solo voice deployments. vLLM is overkill here.
Runs: host service
Surface
Wyoming protocol gateway
Frontend
10300/tcp
Voice gateway. Home Assistant's Wyoming protocol is the open standard for STT/TTS. Compatible with Home Assistant Voice and ESPHome devices.
Runs: Docker container
openWakeWord
Router / orchestrator
Wake-word detection. Open-source wake-word with a few thousand-sample training pipeline. Runs on the device (Pi, ESP32) or on the workstation.
Runs: ESP32 / Pi / workstation
Pipelines
faster-whisper (large-v3-turbo)
Speech-to-text
9000/tcp (websocket)
Speech-to-text. CTranslate2-accelerated; 4-8× faster than vanilla Whisper at the same accuracy. The turbo variant adds another 2× speedup at minor quality cost.
Runs: Docker container, GPU 0 shared
Piper
Text-to-speech
Text-to-speech. ONNX-based TTS with high-quality English voices in <10 MB models. CPU-fast; sub-200ms latency for short replies. Coqui XTTS is the higher-quality alternative.
Runs: host service / Docker, CPU

Hardware

RTX 3060 12 GB is the realistic floor. Whisper large-v3-turbo (1.5 GB) + Qwen 2.5 7B Q5_K_M (~5 GB) + Piper (CPU) leaves 5+ GB GPU headroom — comfortable.

For a single-room solo setup: a Pi 4 or ESP32-S3 dev board handles wake-word + audio capture + Wyoming relay. Mic placement matters more than mic price; an off-axis mic 3 m from the speaker drops STT accuracy noticeably.

If you want to run more rooms, scale wake-word at the edge (one Pi per room) and centralize Whisper/LLM/TTS. Whisper batches small (1-3 streams) on a 3060.

Storage

Trivial. The biggest artifacts are model weights (~7 GB total) and a few hundred MB of Whisper cache. No persistent index unless you add memory (see /workflows/private-chatgpt-replacement).

If you log conversations: each minute of audio at 16kHz mono is ~2 MB; transcripts are ~5 KB. Solo use generates < 1 GB / month. Encrypt at rest if logging.

Networking

Run everything on the LAN. Wake-word devices reach the workstation over plain HTTP/WebSocket on a private VLAN. Latency budget is <800ms end-to-end (wake → STT → LLM → TTS); each LAN hop adds ~1-3ms which is fine.

Do NOT expose Wyoming or Whisper publicly — they have no auth. If you want remote access (talk to your assistant from outside the home), do it via Home Assistant Cloud or a Tailscale-routed Home Assistant.

Observability

Latency is the only metric users feel. Track:

  • STT latency (audio-stop → transcript-ready). Sustained >800ms means Whisper is fighting the LLM for GPU.
  • First-token latency (transcript → first LLM token). Should stay under 300ms on a 3060.
  • TTS latency (LLM done → first audio chunk). Piper holds well under 100ms.
  • Wake-word false-positive rate. Tune the threshold per device; a 1/hour false-positive is annoying, 1/minute is unusable.

Grafana dashboard with the four latency metrics + GPU utilization is enough.

Security

Audio retention. Default to NOT logging audio. Logging transcripts is fine for debugging; logging waveform is invasive.

Wake-word data. If you train a custom wake-word, the training data is private — don't share the dataset.

Network exposure. Voice assistants get curious — never expose them past a private LAN / tailnet.

Tool-calling. If the LLM has tool access (lights, locks, thermostats), gate destructive actions behind explicit confirmation. Are you sure you want to unlock the front door? is non-negotiable.

Upgrade path

More accuracy: swap large-v3-turbo → large-v3 (slower, more accurate on accents). Or stack a domain-tuned ASR for medical/legal vocabularies.

More natural voice: swap Piper → Coqui XTTS or Bark. Costs latency; gain expressiveness.

Multi-user: add per-speaker recognition via WhisperX speaker-diarization. Lets the assistant know who's talking, route to per-user memory.

Smarter brain: drop Qwen 2.5 7B → Qwen 2.5 14B if you have a 4090. Reasoning depth jumps; latency stays acceptable.

What breaks first

  1. GPU contention. Whisper + LLM both want the GPU; concurrent VAD events queue. Either dedicate a GPU to STT or accept ~200ms queueing on busy moments.
  2. Wake-word drift. Acoustic environment changes (new furniture, ceiling fan) shift false-positive rate. Plan to retrain every 6-12 months.
  3. Piper voice freshness. Piper voices are static; over time they sound dated. Re-train or swap when this matters.
  4. Home Assistant integration breakage. HA core releases occasionally bump Wyoming protocol versions; pin add-on versions.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/local-coding-agent →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

  • Unvalidated
    qwen-2.5-7b-instruct via ollama
    • · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
    0 benchmarksSubmit the first benchmark →
✓EditorialValidate this workflow →See benchmark roadmap →How validation works →
Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?