Voice

Week build-out

Local voice assistant pipeline

Real-time speech-to-LLM-to-speech. faster-whisper for transcription, Llama 3.1 8B / Qwen 2.5 7B for the brain, Piper TTS for synthesis. Runs on a single 4090 or even an RTX 3060 with the smaller model. Adds VAD, wake-word, and a low-latency websocket.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,700 words

Build summary

Hardware footprint

RTX 3060 12 GB minimum · 32 GB RAM · USB or I²S microphone

Concurrency

1-3 concurrent rooms (one wake-word per microphone).

Power

~150-250 W idle-to-active.

Goal: A privacy-respecting voice assistant — the Alexa replacement that doesn't ship your audio anywhere.

Operator card

Workflow

Best for

✓Smart-home enthusiasts replacing Alexa/Google Assistant
✓Privacy-conscious households with kids
✓Accessibility deployments (eyes-free home control)
✓Anyone who already runs Home Assistant

Avoid if

⚠You don't have a microphone setup (placement matters more than the rest)
⚠You need multilingual real-time (Whisper handles it but Piper voices are language-pinned)
⚠You're not willing to tune wake-word thresholds

Stability

stable

Maintenance

Weekly attention

Skill

Intermediate

Long-session reliability

reliable

Service ledger

6 services across 3 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

Qwen 2.5 7B Instruct

Model

Brain LLM. Strong tool-calling at the 7B size class. Fits 8 GB cards; leaves headroom for Whisper + Piper on the same GPU.

Runs: Ollama / llama.cpp

Ollama

Inference

11434/tcp (loopback)

Inference engine. One-line setup; OAI-compatible endpoint; the right tool for solo voice deployments. vLLM is overkill here.

Runs: host service

Surface

Wyoming protocol gateway

Frontend

10300/tcp

Voice gateway. Home Assistant's Wyoming protocol is the open standard for STT/TTS. Compatible with Home Assistant Voice and ESPHome devices.

Runs: Docker container

openWakeWord

Router / orchestrator

Wake-word detection. Open-source wake-word with a few thousand-sample training pipeline. Runs on the device (Pi, ESP32) or on the workstation.

Runs: ESP32 / Pi / workstation

Pipelines

faster-whisper (large-v3-turbo)

Speech-to-text

9000/tcp (websocket)

Speech-to-text. CTranslate2-accelerated; 4-8× faster than vanilla Whisper at the same accuracy. The turbo variant adds another 2× speedup at minor quality cost.

Runs: Docker container, GPU 0 shared

Piper

Text-to-speech

Text-to-speech. ONNX-based TTS with high-quality English voices in <10 MB models. CPU-fast; sub-200ms latency for short replies. Coqui XTTS is the higher-quality alternative.

Runs: host service / Docker, CPU

Hardware

RTX 3060 12 GB is the realistic floor. Whisper large-v3-turbo (1.5 GB) + Qwen 2.5 7B Q5_K_M (~5 GB) + Piper (CPU) leaves 5+ GB GPU headroom — comfortable.

For a single-room solo setup: a Pi 4 or ESP32-S3 dev board handles wake-word + audio capture + Wyoming relay. Mic placement matters more than mic price; an off-axis mic 3 m from the speaker drops STT accuracy noticeably.

If you want to run more rooms, scale wake-word at the edge (one Pi per room) and centralize Whisper/LLM/TTS. Whisper batches small (1-3 streams) on a 3060.

Storage

Trivial. The biggest artifacts are model weights (~7 GB total) and a few hundred MB of Whisper cache. No persistent index unless you add memory (see /workflows/private-chatgpt-replacement).

If you log conversations: each minute of audio at 16kHz mono is ~2 MB; transcripts are ~5 KB. Solo use generates < 1 GB / month. Encrypt at rest if logging.

Networking

Run everything on the LAN. Wake-word devices reach the workstation over plain HTTP/WebSocket on a private VLAN. Latency budget is <800ms end-to-end (wake → STT → LLM → TTS); each LAN hop adds ~1-3ms which is fine.

Do NOT expose Wyoming or Whisper publicly — they have no auth. If you want remote access (talk to your assistant from outside the home), do it via Home Assistant Cloud or a Tailscale-routed Home Assistant.

Observability

Latency is the only metric users feel. Track:

STT latency (audio-stop → transcript-ready). Sustained >800ms means Whisper is fighting the LLM for GPU.
First-token latency (transcript → first LLM token). Should stay under 300ms on a 3060.
TTS latency (LLM done → first audio chunk). Piper holds well under 100ms.
Wake-word false-positive rate. Tune the threshold per device; a 1/hour false-positive is annoying, 1/minute is unusable.

Grafana dashboard with the four latency metrics + GPU utilization is enough.

Security

Audio retention. Default to NOT logging audio. Logging transcripts is fine for debugging; logging waveform is invasive.

Wake-word data. If you train a custom wake-word, the training data is private — don't share the dataset.

Network exposure. Voice assistants get curious — never expose them past a private LAN / tailnet.

Tool-calling. If the LLM has tool access (lights, locks, thermostats), gate destructive actions behind explicit confirmation. Are you sure you want to unlock the front door? is non-negotiable.

Upgrade path

More accuracy: swap large-v3-turbo → large-v3 (slower, more accurate on accents). Or stack a domain-tuned ASR for medical/legal vocabularies.

More natural voice: swap Piper → Coqui XTTS or Bark. Costs latency; gain expressiveness.

Multi-user: add per-speaker recognition via WhisperX speaker-diarization. Lets the assistant know who's talking, route to per-user memory.

Smarter brain: drop Qwen 2.5 7B → Qwen 2.5 14B if you have a 4090. Reasoning depth jumps; latency stays acceptable.

What breaks first

GPU contention. Whisper + LLM both want the GPU; concurrent VAD events queue. Either dedicate a GPU to STT or accept ~200ms queueing on busy moments.
Wake-word drift. Acoustic environment changes (new furniture, ceiling fan) shift false-positive rate. Plan to retrain every 6-12 months.
Piper voice freshness. Piper voices are static; over time they sound dated. Re-train or swap when this matters.
Home Assistant integration breakage. HA core releases occasionally bump Wyoming protocol versions; pin add-on versions.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/local-coding-agent →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Workflow validation

unvalidated

Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.

Unvalidated
qwen-2.5-7b-instruct via ollama
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →

EditorialValidate this workflow →See benchmark roadmap →How validation works →