HuggingFace 429 rate limit — authenticate or wait
HuggingFace returns HTTP 429 when you exceed the anonymous rate limit. A free account + token raises your ceiling dramatically. Here's exactly what triggers it, how to authenticate, and how to batch downloads so you never hit it again.
Diagnostic order — most likely first
Anonymous downloads exceeding HuggingFace's rate ceiling
You haven't logged in (no `huggingface-cli login`, no `HF_TOKEN` env var). You're downloading multiple large model files concurrently. After 10-20 GB of anonymous downloads in a short window, requests start returning HTTP 429.
Create a free account at huggingface.co. Generate a READ token at huggingface.co/settings/tokens. Run `huggingface-cli login` and paste the token, or `export HF_TOKEN=hf_xxxxx`. Authenticated downloads get orders-of-magnitude higher rate limits and aren't throttled at the anonymous tier.
Download accelerator with aggressive parallelism triggering 429 at the CDN level
Using `huggingface-cli download --resume-download` with high concurrency, or `aria2c -x 16`, or a Python script with `threading.Thread` × N pulling files concurrently. After a few minutes, 429s appear even with `HF_TOKEN` set — the CDN layer (Cloudflare) is throttling, not HuggingFace's app layer.
Lower concurrency to 4-8 parallel connections. `huggingface-cli download` uses 4 connections by default. aria2c: use `-x 4` instead of `-x 16`. Add `-s 4` to limit the number of files being downloaded simultaneously. Python: use a `ThreadPoolExecutor(max_workers=4)` instead of spawning threads manually.
Retry loop without backoff hammering the endpoint
Your download script caught the 429, but the retry logic fires immediately without backoff (or with sub-second delays). Each retry makes the rate limiter more aggressive — HTTP 429 followed by 429 followed by a longer 429 ban.
Implement exponential backoff with jitter. In Python: `time.sleep(min(60, 2**attempt + random.uniform(0, 1)))`. In bash: `sleep $((2**$attempt + RANDOM % 10))`. In huggingface-cli: the tool has built-in retry; pass `--resume-download` and it handles backoff. If you're wrapping it in a script, don't add your own retry on top.
Corporate network / VPN sharing IP with other heavy downloaders
You're on a corporate VPN or shared office network. The rate limiter sees the shared public IP, not your machine. If 10 coworkers are also pulling models from HuggingFace on the same IP, the combined traffic triggers the CDN rate limit.
Switch to a different network if possible. If stuck on the shared IP: stagger download times outside business hours, use `HF_TOKEN` (authenticated traffic is pooled at the user level, not the IP level at the HF application layer), and reduce concurrency. The CDN layer may still throttle the shared IP — if persistent, use the community mirror: `export HF_ENDPOINT=https://hf-mirror.com`.
Frequently asked questions
How long does a HuggingFace rate limit ban last?
The application-level anonymous rate limit typically cools within 15-30 minutes. The CDN-level throttle (Cloudflare) can last 1-4 hours depending on severity. With a valid HF_TOKEN, the application-level limit effectively disappears for normal use. The CDN limit can still trigger if you're making hundreds of requests per minute.
Does logging in with a free account actually raise the limit?
Yes, dramatically. Anonymous traffic is capped at roughly 1-2 GB/hour in practice. Authenticated traffic (free account) has no hard cap for normal download patterns — you can pull 100+ GB without hitting a limit as long as you're not using aggressive parallelism. The token identifies you as a human user rather than a bot.
Can I cache models locally to avoid re-downloading and hitting rate limits?
Yes — HuggingFace caches everything by default. Downloaded models live in `~/.cache/huggingface/hub/` (override with `HF_HOME`). The cache is content-addressed by blob SHA256, so if you download the same model from different scripts, it's served from cache. Set up a shared cache directory on a NAS or shared drive for team workflows: `export HF_HOME=/mnt/shared/hf-cache`. This is the single best defense against rate limits for repeat workflows.
Is there a way to download models without touching HuggingFace at all?
For some models, yes. The Ollama registry (registry.ollama.ai) mirrors popular models on a different CDN. Modelscope (modelscope.cn) hosts a large collection for primarily a Chinese audience but is globally accessible. CivitAI hosts diffusion models. But HuggingFace has the broadest catalog — for most models, it's the canonical source. If you're rate-limited, the mirror (`hf-mirror.com`) is the next-best path.
What's the difference between a READ token and a WRITE token for HuggingFace?
READ token: can download public and gated models (once you've accepted the license in browser). Can't push or modify repos. This is what you want for inference + downloading. WRITE (fine-grained) token: can push, modify, and manage repos. Only needed if you're uploading models or datasets. Always use a READ token for download workflows — minimizes damage if the token leaks.
Related troubleshooting
HuggingFace download errors split into auth (gated model, no token), rate-limit (anonymous traffic capped), or network (corporate proxy, country block). Diagnose by HTTP status code, fix per cause.
Ollama 'model not found' errors trace to typos in the model name, pulling a model that doesn't exist in the official registry, network blocks on the registry, or pulling from a custom registry without auth.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: