Local voice assistant pipeline
Real-time speech-to-LLM-to-speech. faster-whisper for transcription, Llama 3.1 8B / Qwen 2.5 7B for the brain, Piper TTS for synthesis. Runs on a single 4090 or even an RTX 3060 with the smaller model. Adds VAD, wake-word, and a low-latency websocket.
Build summary
Goal: A privacy-respecting voice assistant — the Alexa replacement that doesn't ship your audio anywhere.
Operator card
- ✓Smart-home enthusiasts replacing Alexa/Google Assistant
- ✓Privacy-conscious households with kids
- ✓Accessibility deployments (eyes-free home control)
- ✓Anyone who already runs Home Assistant
- ⚠You don't have a microphone setup (placement matters more than the rest)
- ⚠You need multilingual real-time (Whisper handles it but Piper voices are language-pinned)
- ⚠You're not willing to tune wake-word thresholds
Service ledger
6 services across 3 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
RTX 3060 12 GB is the realistic floor. Whisper large-v3-turbo (1.5 GB) + Qwen 2.5 7B Q5_K_M (~5 GB) + Piper (CPU) leaves 5+ GB GPU headroom — comfortable.
For a single-room solo setup: a Pi 4 or ESP32-S3 dev board handles wake-word + audio capture + Wyoming relay. Mic placement matters more than mic price; an off-axis mic 3 m from the speaker drops STT accuracy noticeably.
If you want to run more rooms, scale wake-word at the edge (one Pi per room) and centralize Whisper/LLM/TTS. Whisper batches small (1-3 streams) on a 3060.
Storage
Trivial. The biggest artifacts are model weights (~7 GB total) and a few hundred MB of Whisper cache. No persistent index unless you add memory (see /workflows/private-chatgpt-replacement).
If you log conversations: each minute of audio at 16kHz mono is ~2 MB; transcripts are ~5 KB. Solo use generates < 1 GB / month. Encrypt at rest if logging.
Networking
Run everything on the LAN. Wake-word devices reach the workstation over plain HTTP/WebSocket on a private VLAN. Latency budget is <800ms end-to-end (wake → STT → LLM → TTS); each LAN hop adds ~1-3ms which is fine.
Do NOT expose Wyoming or Whisper publicly — they have no auth. If you want remote access (talk to your assistant from outside the home), do it via Home Assistant Cloud or a Tailscale-routed Home Assistant.
Observability
Latency is the only metric users feel. Track:
- STT latency (audio-stop → transcript-ready). Sustained >800ms means Whisper is fighting the LLM for GPU.
- First-token latency (transcript → first LLM token). Should stay under 300ms on a 3060.
- TTS latency (LLM done → first audio chunk). Piper holds well under 100ms.
- Wake-word false-positive rate. Tune the threshold per device; a 1/hour false-positive is annoying, 1/minute is unusable.
Grafana dashboard with the four latency metrics + GPU utilization is enough.
Security
Audio retention. Default to NOT logging audio. Logging transcripts is fine for debugging; logging waveform is invasive.
Wake-word data. If you train a custom wake-word, the training data is private — don't share the dataset.
Network exposure. Voice assistants get curious — never expose them past a private LAN / tailnet.
Tool-calling. If the LLM has tool access (lights, locks, thermostats), gate destructive actions behind explicit confirmation. Are you sure you want to unlock the front door? is non-negotiable.
Upgrade path
More accuracy: swap large-v3-turbo → large-v3 (slower, more accurate on accents). Or stack a domain-tuned ASR for medical/legal vocabularies.
More natural voice: swap Piper → Coqui XTTS or Bark. Costs latency; gain expressiveness.
Multi-user: add per-speaker recognition via WhisperX speaker-diarization. Lets the assistant know who's talking, route to per-user memory.
Smarter brain: drop Qwen 2.5 7B → Qwen 2.5 14B if you have a 4090. Reasoning depth jumps; latency stays acceptable.
What breaks first
- GPU contention. Whisper + LLM both want the GPU; concurrent VAD events queue. Either dedicate a GPU to STT or accept ~200ms queueing on busy moments.
- Wake-word drift. Acoustic environment changes (new furniture, ceiling fan) shift false-positive rate. Plan to retrain every 6-12 months.
- Piper voice freshness. Piper voices are static; over time they sound dated. Re-train or swap when this matters.
- Home Assistant integration breakage. HA core releases occasionally bump Wyoming protocol versions; pin add-on versions.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Workflow validation
Each row is a (model × hardware × runtime) triple this workflow claims. Validation is rule-based: 0 validated by reproduced benchmarks, 0 supported by single-source benchmarks, 0 supported by same-family hardware, 0 supported by adjacent-hardware measurements, 1 currently unvalidated. We never fabricate validation; if no benchmark exists, we say so.
- Unvalidatedqwen-2.5-7b-instruct via ollama
- · No public benchmarks yet. The workflow's claim about this model is currently unsubstantiated by measurements.
0 benchmarksSubmit the first benchmark →