Whisper family — local AI ecosystem · RunLocalAI

Best entry point for local use

Start with Whisper large-v3 via faster-whisper on RTX 3060 12GB — Whisper large-v3 is the reference open-weight speech-to-text model supporting 99 languages with best-in-class English transcription accuracy (WER ~8% on LibriSpeech clean). faster-whisper's CTranslate2 backend achieves ~50× real-time transcription on RTX 3060 12GB — a 1-hour audio file transcribes in ~70 seconds. For lower VRAM (<6 GB), use Whisper medium (769M params) — ~20× real-time, WER ~10%, acceptable for non-production use. For on-device mobile, Whisper tiny (39M params) runs on smartphone CPU at ~5× real-time via whisper.cpp. Skip Whisper large-v3-turbo for local deployment — the 8-layer decoder reduction saves only ~15% VRAM for ~5% WER degradation; large-v3 is the better quality-per-VRAM tradeoff. MIT license.

Deployment guidance

For single-user transcription: faster-whisper + whisper large-v3 FP16 on RTX 3060 12GB — ~50× real-time with CTranslate2 INT8 model. Install: pip install faster-whisper then faster-whisper large-v3 audio.mp3. For CPU-only: whisper.cpp + whisper large-v3 Q5_0 on Apple M3 — ~8× real-time via CoreML, 4 GB RAM, excellent for offline transcription. For multi-user serving: faster-whisper server mode behind a REST API with GPU worker pool — 1× L4 24 GB handles ~15 concurrent transcription streams. For mobile: whisper.cpp with whisper tiny/medium on Snapdragon X Elite via ARM NEON — ~5× real-time on-device. For real-time streaming: whisper.cpp streaming mode with VAD (Voice Activity Detection) — transcribes chunks as they arrive with ~2-second latency. See GPU buyer guide.

Whisper

Best entry point for local use

Deployment guidance

Related — keep moving