Aphrodite Engine
vLLM fork specialized for creative writing / role-play workloads. Adds samplers (smoothing factor, dynatemp, mirostat, DRY, XTC) that mainline vLLM doesn't ship. Same continuous-batching architecture; trades some throughput for sampler richness.
Overview
vLLM fork specialized for creative writing / role-play workloads. Adds samplers (smoothing factor, dynatemp, mirostat, DRY, XTC) that mainline vLLM doesn't ship. Same continuous-batching architecture; trades some throughput for sampler richness.
Setup guidance
Install via pip in a Python 3.10+ venv with CUDA 12.1+: pip install aphrodite-engine. Aphrodite is a fork of vLLM optimized for single-user throughput rather than multi-tenant serving. Start: aphrodite run meta-llama/Llama-3.1-8B-Instruct --port 2242. The server exposes an OpenAI-compatible API at /v1/chat/completions. Verify: curl http://localhost:2242/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'. Aphrodite maintains vLLM's PagedAttention KV-cache management but replaces the continuous batching scheduler with a single-stream-optimized path. It supports EXL2, AWQ, GPTQ, and FP8 quantization formats. For GGUF models, use aphrodite run ./model.gguf. First run downloads the model from HuggingFace (~5–20 minutes for 70B). Time-to-first-response from zero: ~10 minutes. Aphrodite also includes a SillyTavern-compatible API mode for roleplay/chat UI integrations.
Workload fit
Best for: single-user high-throughput local LLM serving on NVIDIA GPUs, roleplay and creative writing scenarios where high single-stream decode speed enhances the interactive experience, SillyTavern and character-chat frontend integration, users who want vLLM's PagedAttention memory management without the continuous batching complexity, GGUF-based model users who want faster decode than raw llama.cpp CUDA. Not suited for: multi-tenant production serving (use vLLM), CPU-only or Apple Silicon deployment (NVIDIA-only), non-NVIDIA GPU use, workloads requiring multiple concurrent users, users who need automatic model management (use Ollama).
Alternatives
Use Aphrodite when you want vLLM-level single-user throughput with less operational complexity — it strips continuous batching complexity for the single-user case and is the go-to engine for roleplay and creative writing. Switch to vLLM when you need multi-tenant concurrency — Aphrodite's scheduler is not optimized for concurrent requests. Use ExLlamaV2 when you want maximum single-stream decode speed on consumer NVIDIA GPUs and can accept EXL2 format conversion. Use Ollama for zero-config desktop LLM with automatic model management — Aphrodite requires explicit model specification and Python environment setup. Use KoboldCPP when you need a bundled chat UI, Windows-native deployment without Python, and the full GGUF ecosystem. Aphrodite sits between vLLM and ExLlamaV2: more single-user throughput than vLLM, more model format support than ExLlamaV2.
Troubleshooting + when to switch
Problem: Performance identical to vLLM, no throughput gain. Fix: Aphrodite's single-user optimization engages when concurrency is 1. If you're testing with multiple concurrent requests, Aphrodite falls back to near-vLLM behavior. Test with single sequential requests. Enable --enforce-eager to bypass the CUDA graph optimization which can mask single-user gains. Problem: GGUF model fails to load. Fix: Aphrodite's GGUF support is via llama.cpp integration, not all GGUF quantizations are supported. Stick to Q4_K_M, Q5_K_M, and Q8_0 formats. Below Q4_K_M, Aphrodite may reject the model or produce garbage output. Problem: SillyTavern connection fails. Fix: Aphrodite's SillyTavern API mode requires --api-type kobold flag. The endpoint is at /api/v1/generate on port 2242, not the standard OpenAI endpoint. Ensure SillyTavern is configured as a "KoboldAI" API type pointing to http://localhost:2242.
Pros
- Sampling-method richness — DRY / XTC / dynatemp don't exist in stock vLLM
- OpenAI-compatible API like vLLM — drop-in for compatible clients
- Strong fit for SillyTavern / TavernAI / role-play workloads
Cons
- Lags vLLM mainline on new model architectures by 2-6 weeks
- Smaller community + fewer production deployments
- Throughput slightly trails vLLM at high concurrency
Compatibility
| Operating systems | Linux Windows |
| GPU backends | NVIDIA CUDA AMD ROCm |
| License | Open source · free + open-source |
Runtime health
Operator-grade signals on how actively Aphrodite Engine is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
5 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Get Aphrodite Engine
Frequently asked
Is Aphrodite Engine free?
What operating systems does Aphrodite Engine support?
Which GPUs work with Aphrodite Engine?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify Aphrodite Engine runs on your specific hardware before committing money.