Identifying who-spoke-when in multi-speaker audio. PyAnnote is the open-weight default.
pip install pyannote.audio (PyAnnote — the standard open-weight speaker diarization library).from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN")
diarization = pipeline("meeting.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"Speaker {speaker}: {turn.start:.1f}s - {turn.end:.1f}s")
pip install whisperx integrates both steps: whisperx meeting.wav --model large-v3 --diarize --hf_token YOUR_TOKEN.PyAnnote diarization runs entirely on CPU at 20-50× real-time. Any $300 laptop diarizes a 1-hour meeting in 2-3 minutes. WhisperX + diarization (full pipeline) adds the STT step: 3-5 minutes per hour on CPU (Whisper large-v3 on CPU is slower). For GPU-accelerated STT: a used GTX 1060 6 GB ($60) drops WhisperX to 3-5× real-time — a 1-hour meeting transcribes + diarizes in ~15 minutes. Total build: ~$360. Diarization is CPU-friendly; the STT stage benefits from GPU.
Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles the full WhisperX + diarization pipeline. A 1-hour meeting transcribes + diarizes in ~5-8 minutes (Whisper large-v3 at 15-20× real-time + diarization at 20-50×). For meeting-intelligence platforms processing 100+ hours/day: batch pipeline on multiple GPUs. Total build: ~$700-900. For enterprise meeting transcription (Zoom/Teams integration), the compute is light — budget for audio storage and the integration layer. Diarization accuracy plateaus at ~90-95% regardless of GPU — model improvements, not hardware, are the bottleneck.
The mistake: Running diarization on a meeting recording, getting "Speaker 1, Speaker 2, Speaker 3" labels, and presenting it as "automated meeting notes with speaker identification." Why it fails: PyAnnote identifies speech segments and clusters them by voice similarity. It labels them SPEAKER_00, SPEAKER_01 — not "Alice, Bob, Charlie." The diarizer doesn't know names. If you present generic labels as identification, you haven't identified anyone. The fix: Add a speaker enrollment step. Record 30 seconds of each participant speaking (or extract from previous meetings). Enroll these voice prints. When diarizing, compare each speaker cluster to the enrolled voice prints → map SPEAKER_00 → "Alice." Without enrollment, you have speaker separation (who spoke when), not speaker identification (who is who). Separation is useful; identification requires enrollment.
Browse all tools for runtimes that fit this workload.
Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.
The errors most operators hit when running speaker diarization locally. Each links to a diagnose+fix walkthrough.
Verify your specific hardware can handle speaker diarization before committing money.