Hugging Face Text Generation Inference (TGI)
Also known as: tgi, huggingface-tgi, hf-tgi
Hugging Face Text Generation Inference (TGI) is a production-grade inference server for large language models, optimized for high throughput and low latency on GPU clusters. It supports continuous batching, tensor parallelism across multiple GPUs, and quantization (bitsandbytes, GPTQ, AWQ). Operators encounter TGI when deploying models via Hugging Face's Inference Endpoints or self-hosting with Docker on multi-GPU rigs. It competes with vLLM and llama.cpp for serving scenarios, but TGI is tightly integrated with the Hugging Face ecosystem (model hub, tokenizers, safetensors).
Deeper dive
TGI is designed for serving LLMs at scale, not for single-user local inference. It uses a custom CUDA kernel for Flash Attention and PagedAttention (similar to vLLM) to manage KV cache efficiently. Key features: continuous batching (dynamically add/remove requests per step), tensor parallelism (split model across GPUs via NCCL), and support for popular quantization methods. TGI exposes a REST API compatible with OpenAI's chat completions endpoint, making it a drop-in replacement for OpenAI API calls. It also supports streaming, logprobs, and stopping criteria. For local operators, TGI is overkill unless running a multi-GPU server; single-GPU users typically prefer vLLM or llama.cpp for lower overhead.
Practical example
An operator with a 4x RTX 4090 rig (96 GB total VRAM) runs TGI to serve Llama 3.1 70B at Q4 (≈40 GB). With tensor parallelism across 4 GPUs, each GPU holds ~10 GB of weights. TGI's continuous batching allows 10 concurrent users to get ~30 tok/s each, vs. 5 tok/s without batching. The operator deploys via Docker: docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:2.0 --model-id meta-llama/Meta-Llama-3.1-70B --quantize awq.
Workflow example
In a production workflow, an operator first pulls a model from Hugging Face Hub using huggingface-cli download meta-llama/Meta-Llama-3.1-70B. Then launches TGI with --model-id pointing to the local cache. Clients send POST requests to http://localhost:8080/v1/chat/completions with OpenAI-style payloads. The operator monitors GPU utilization with nvidia-smi and adjusts --max-batch-prefill-tokens to avoid OOM. For scaling, they add --num-shard 4 for tensor parallelism. TGI logs show request latency and batch sizes.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.