Aphrodite Engine

vLLM fork specialized for creative writing / role-play workloads. Adds samplers (smoothing factor, dynatemp, mirostat, DRY, XTC) that mainline vLLM doesn't ship. Same continuous-batching architecture; trades some throughput for sampler richness.

By Fredoline Eruo·Last verified May 9, 2026·1,700 GitHub stars

Setup guidance

Install via pip in a Python 3.10+ venv with CUDA 12.1+: pip install aphrodite-engine. Aphrodite is a fork of vLLM optimized for single-user throughput rather than multi-tenant serving. Start: aphrodite run meta-llama/Llama-3.1-8B-Instruct --port 2242. The server exposes an OpenAI-compatible API at /v1/chat/completions. Verify: curl http://localhost:2242/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","messages":[{"role":"user","content":"Hello"}]}'. Aphrodite maintains vLLM's PagedAttention KV-cache management but replaces the continuous batching scheduler with a single-stream-optimized path. It supports EXL2, AWQ, GPTQ, and FP8 quantization formats. For GGUF models, use aphrodite run ./model.gguf. First run downloads the model from HuggingFace (~5–20 minutes for 70B). Time-to-first-response from zero: ~10 minutes. Aphrodite also includes a SillyTavern-compatible API mode for roleplay/chat UI integrations.

Workload fit

Best for: single-user high-throughput local LLM serving on NVIDIA GPUs, roleplay and creative writing scenarios where high single-stream decode speed enhances the interactive experience, SillyTavern and character-chat frontend integration, users who want vLLM's PagedAttention memory management without the continuous batching complexity, GGUF-based model users who want faster decode than raw llama.cpp CUDA. Not suited for: multi-tenant production serving (use vLLM), CPU-only or Apple Silicon deployment (NVIDIA-only), non-NVIDIA GPU use, workloads requiring multiple concurrent users, users who need automatic model management (use Ollama).

Alternatives

Use Aphrodite when you want vLLM-level single-user throughput with less operational complexity — it strips continuous batching complexity for the single-user case and is the go-to engine for roleplay and creative writing. Switch to vLLM when you need multi-tenant concurrency — Aphrodite's scheduler is not optimized for concurrent requests. Use ExLlamaV2 when you want maximum single-stream decode speed on consumer NVIDIA GPUs and can accept EXL2 format conversion. Use Ollama for zero-config desktop LLM with automatic model management — Aphrodite requires explicit model specification and Python environment setup. Use KoboldCPP when you need a bundled chat UI, Windows-native deployment without Python, and the full GGUF ecosystem. Aphrodite sits between vLLM and ExLlamaV2: more single-user throughput than vLLM, more model format support than ExLlamaV2.

Troubleshooting + when to switch

Problem: Performance identical to vLLM, no throughput gain. Fix: Aphrodite's single-user optimization engages when concurrency is 1. If you're testing with multiple concurrent requests, Aphrodite falls back to near-vLLM behavior. Test with single sequential requests. Enable --enforce-eager to bypass the CUDA graph optimization which can mask single-user gains. Problem: GGUF model fails to load. Fix: Aphrodite's GGUF support is via llama.cpp integration, not all GGUF quantizations are supported. Stick to Q4_K_M, Q5_K_M, and Q8_0 formats. Below Q4_K_M, Aphrodite may reject the model or produce garbage output. Problem: SillyTavern connection fails. Fix: Aphrodite's SillyTavern API mode requires --api-type kobold flag. The endpoint is at /api/v1/generate on port 2242, not the standard OpenAI endpoint. Ensure SillyTavern is configured as a "KoboldAI" API type pointing to http://localhost:2242.

Runtime health

Operator-grade signals on how actively Aphrodite Engine is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active

Updated May 9, 2026

5 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Frequently asked

Is Aphrodite Engine free?

Aphrodite Engine has a paid tier (free + open-source). Check the pricing page for current terms.

What operating systems does Aphrodite Engine support?

Aphrodite Engine supports Linux, Windows.

Which GPUs work with Aphrodite Engine?

Aphrodite Engine supports NVIDIA CUDA, AMD ROCm. CPU-only inference is also possible but slow.

Operating systems	Linux Windows
GPU backends	NVIDIA CUDA AMD ROCm
License	Open source · free + open-source