Speaker Diarization

Identifying who-spoke-when in multi-speaker audio. PyAnnote is the open-weight default.

Setup walkthrough

pip install pyannote.audio (PyAnnote — the standard open-weight speaker diarization library).
Accept PyAnnote's license on HuggingFace (huggingface.co/pyannote/speaker-diarization-3.1) and generate an access token.
Python script:

from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN")
diarization = pipeline("meeting.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"Speaker {speaker}: {turn.start:.1f}s - {turn.end:.1f}s")

First diarization result in 10-30 seconds for a 10-minute meeting on CPU. The model processes audio at ~20-50× real-time.
For the complete pipeline (who said what): combine PyAnnote diarization + WhisperX transcription → WhisperX aligns transcript words to timestamps → match timestamps to PyAnnote speaker segments → labeled transcript.
Pip install: pip install whisperx integrates both steps: whisperx meeting.wav --model large-v3 --diarize --hf_token YOUR_TOKEN.

The cheap setup

PyAnnote diarization runs entirely on CPU at 20-50× real-time. Any $300 laptop diarizes a 1-hour meeting in 2-3 minutes. WhisperX + diarization (full pipeline) adds the STT step: 3-5 minutes per hour on CPU (Whisper large-v3 on CPU is slower). For GPU-accelerated STT: a used GTX 1060 6 GB ($60) drops WhisperX to 3-5× real-time — a 1-hour meeting transcribes + diarizes in ~15 minutes. Total build: ~$360. Diarization is CPU-friendly; the STT stage benefits from GPU.

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles the full WhisperX + diarization pipeline. A 1-hour meeting transcribes + diarizes in ~5-8 minutes (Whisper large-v3 at 15-20× real-time + diarization at 20-50×). For meeting-intelligence platforms processing 100+ hours/day: batch pipeline on multiple GPUs. Total build: ~$700-900. For enterprise meeting transcription (Zoom/Teams integration), the compute is light — budget for audio storage and the integration layer. Diarization accuracy plateaus at ~90-95% regardless of GPU — model improvements, not hardware, are the bottleneck.

Common beginner mistake

The mistake: Running diarization on a meeting recording, getting "Speaker 1, Speaker 2, Speaker 3" labels, and presenting it as "automated meeting notes with speaker identification." Why it fails: PyAnnote identifies speech segments and clusters them by voice similarity. It labels them SPEAKER_00, SPEAKER_01 — not "Alice, Bob, Charlie." The diarizer doesn't know names. If you present generic labels as identification, you haven't identified anyone. The fix: Add a speaker enrollment step. Record 30 seconds of each participant speaking (or extract from previous meetings). Enroll these voice prints. When diarizing, compare each speaker cluster to the enrolled voice prints → map SPEAKER_00 → "Alice." Without enrollment, you have speaker separation (who spoke when), not speaker identification (who is who). Separation is useful; identification requires enrollment.

Recommended setup for speaker diarization

Recommended hardware

Best GPU for local AI →

Audio models are compute-light; most 8-16 GB cards work.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Audio models are surprisingly forgiving on hardware. Whisper, Coqui, OpenAI Whisper-cpp all run well on 8-12 GB GPUs. The bottleneck is rarely the GPU; it's audio preprocessing and disk I/O for batch transcription.

Common mistakes

Overspending on GPU for audio-only workflows (8-12 GB is enough for Whisper)
Running audio + LLM concurrently without budgeting VRAM
Using fp32 weights when fp16 / int8 give 2-3x speedup with no quality loss
Forgetting audio preprocessing eats CPU cycles — a fast SSD helps more than expected

What breaks first

The errors most operators hit when running speaker diarization locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle speaker diarization before committing money.

Related tasks

Speech-to-Text (STT)

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →