Best GPU for voice cloning
Honest 2026 guide to GPU hardware for local voice cloning. Surprisingly light — 8-12 GB works for most workflows. CPU paths (Piper, Kokoro) often enough. When a GPU even matters for TTS and voice cloning.
The short answer
Voice cloning is surprisingly light on hardware. Most open-source TTS models (XTTS-v2, F5-TTS, StyleTTS2) run comfortably on 8-12 GB VRAM. You don't need a 24 GB GPU for voice cloning — and in many cases, you don't need a GPU at all.
The used RTX 3060 12 GB at $200-280 is the value sweet spot — 12 GB runs F5-TTS fine-tuning and XTTS-v2 zero-shot cloning comfortably. If you want warranty + headroom, the 4060 Ti 16 GB at $450 is overkill for TTS but future-proofs.
For CPU-only paths: Kokoro TTS and Piper TTS run entirely on CPU with good quality and speed. Many voice cloning pipelines don't need GPU at all — just fast CPU inference with ONNX or GGUF. This is the local-AI workload where GPU spending has the worst ROI.
The picks, ranked by buyer-leverage
12 GB · $200-280 (2026 used)
12 GB runs XTTS-v2 zero-shot + F5-TTS fine-tuning comfortably. The best $/work-done card in voice cloning.
- XTTS-v2 voice cloning (zero-shot, fine-tune)
- F5-TTS single-voice generation
- Buyers who want the cheapest GPU that covers voice cloning
- Multi-voice concurrent generation (need 16 GB+)
- Buyers who want warranty (buy 4060 Ti 16 GB new)
- Operators who also run LLMs on the same card
16 GB · $450-550 (2026 retail)
16 GB overkill for voice cloning alone — but worth it if you also run LLMs or image gen on the same card.
- Voice cloning + LLM inference same GPU
- Multi-voice concurrent TTS generation
- Buyers wanting new + warranty + future headroom
- Voice-cloning-only operators (12 GB is enough)
- Buyers who will use CPU-only TTS paths
- Tight budgets (used 3060 12 GB handles it for half the price)
24 GB · $1,399 (Mac mini M4 Pro 24 GB, 2026)
24 GB unified runs voice cloning + LLM colocated. Best always-on TTS server with zero fan noise.
- Mac-first voice cloning pipelines (MLX backends)
- Always-on TTS server alongside LLM inference
- Developers who value silent operation
- CUDA-optimized TTS pipelines (XTTS CUDA path faster)
- Budget-constrained builders (used 3060 12 GB is $200)
- Windows TTS workflows (some tools Mac-only via MLX)
36 GB · $2,800-3,200 (M4 Max 36 GB MacBook Pro, 2026)
36 GB unified runs voice cloning + LLM + image gen on a laptop. Overkill for TTS, perfect for full-stack local AI.
- Mobile voice cloning + full local-AI stack
- Developers who need TTS + LLM + ComfyUI on one laptop
- Silent always-on personal AI server
- Voice-cloning-only users (massively overkill)
- Budget-constrained buyers (3060 12 GB desktop is $200)
- CUDA-locked workflows
HonestyWhy benchmark numbers on this page might not reflect your real experience
- tok/s is not user experience. Humans read at ~10-15 tok/s — anything above that is buffer time, not perceived speed.
- Context length changes everything. A 70B Q4 model at 1024 tokens generates ~25 tok/s; the same model at 32K context drops to ~8-12 tok/s as KV cache fills.
- Quantization changes the conclusion. Q4_K_M vs Q5_K_M vs Q8 produce different speed AND different quality. A benchmark at one quant doesn't translate to another.
- Thermal throttling changes long sessions. The first 15 minutes of a benchmark see boost-clock peak; the next 4 hours see steady-state, which is 5-15% slower depending on case airflow.
- Driver and runtime versions silently shift winners. A 2024 benchmark on PyTorch 2.4 + CUDA 12.4 doesn't reflect 2026 reality on PyTorch 2.6 + CUDA 12.6. Discount benchmarks older than 6 months.
- Vendor and YouTuber benchmarks are cherry-picked. The standard 'Llama 3.1 70B Q4 at 1024 tokens' chart shows peak decode on a tiny prompt — exactly the conditions least representative of daily use.
- Our ranking is by workload fit at the buyer's actual budget — not by raw benchmark order. A faster card that doesn't fit your workload ranks below a slower card that does.
We try to surface these caveats where they apply. If a number on this page reads more confident than it should, please email us via contact. See also our methodology and editorial philosophy.
How to think about VRAM tiers
Voice cloning is the outlier workload — even 4 GB cards run many TTS models. The VRAM question is less 'what do I need?' and more 'what else do I want to run on this card?'
- 4 GB — Kokoro CPU-only, Piper CPU-only. No dedicated GPU needed. Fine for basic TTS.
- 8 GB — XTTS-v2 zero-shot cloning. F5-TTS single-voice. Reasonable GPU floor for voice cloning.
- 12 GB — F5-TTS fine-tuning + XTTS-v2 fine-tuning comfortable. Multi-voice batches. Voice cloning sweet spot.
- 16 GB+ — Multi-voice concurrent generation + LLM colocated. Overkill for voice cloning alone but high-leverage if GPU serves multiple workloads.
Compare these picks head-to-head
Frequently asked questions
Do I need a GPU for voice cloning?
No. Kokoro TTS and Piper TTS run on CPU with ONNX/GGUF and deliver good quality at 2-5x real-time on modern CPUs. XTTS-v2 also has a CPU path (slower, but overnight batch synthesis works). GPU accelerates XTTS-v2 and F5-TTS 5-10x but is optional, not required.
Can I run voice cloning on an 8 GB GPU?
Yes. XTTS-v2 zero-shot works on 6 GB minimum. F5-TTS single-voice runs on 8 GB. Fine-tuning needs 10-12 GB. 8 GB is a workable voice cloning floor — unlike LLM or image gen where 8 GB is below modern minimum.
Is voice cloning faster on GPU vs CPU?
Yes, significantly. XTTS-v2 zero-shot: GPU (5-10x real-time) vs CPU (0.5-1x real-time). For real-time TTS (streaming voice assistant), GPU is mandatory. For overnight batch synthesis, CPU is fine.
What's the best open-source voice cloning model?
F5-TTS leads on quality (natural prosody, speaker adaptation). XTTS-v2 leads on zero-shot cloning + multi-language. StyleTTS2 leads on controllability. All three run comfortably on 12 GB VRAM. Pick based on your cloning type (zero-shot vs fine-tune vs controllable).
Can I run voice cloning + LLM on the same GPU?
Yes, if VRAM allows. A 16 GB card fits a 13B Q4 LLM (~8 GB) + XTTS-v2 (~4 GB) concurrently. For 70B LLM + TTS, you need 24 GB minimum. Most operators run TTS sequentially after LLM text generation to avoid concurrent VRAM contention.
Why do people overspend on GPUs for voice cloning?
Because they mistake voice cloning's hardware requirements for LLM/vision model requirements. Voice cloning is 10-50x lighter. A $250 used 3060 12 GB handles it — the same card that struggles with 70B LLMs breezes through TTS. Don't buy a 4090 for XTTS-v2.
Go deeper
- Best GPU for Whisper (STT) — Speech-to-text hardware — the other half of the voice pipeline
- Best GPU for local AI (pillar) — All workloads ranked — TTS is the outlier at low VRAM
- Best budget GPU under $500 — Sub-$500 cards that handle voice + more
- RTX 3060 12 GB full verdict — Deep-dive on the voice cloning value pick
When it doesn't work
Hardware bought, set up correctly, still failing? The highest-volume local-AI errors and their fixes:
Common alternatives readers consider:
- If your budget is tighter →best budget GPU for local AI
- If you'd rather buy used →best used GPU for local AI
- If you're on Apple Silicon →best Mac for local AI
- If you're not sure what fits your build →the will-it-run checker
- If you don't want to buy anything yet →our editorial philosophy