fatalEditorialReviewed May 2026

HuggingFace download failed — fix auth, rate limits, network

HuggingFace download errors split into auth (gated model, no token), rate-limit (anonymous traffic capped), or network (corporate proxy, country block). Diagnose by HTTP status code, fix per cause.

huggingface_hub Pythonhuggingface-clidiffuserstransformersany HF-pulling tool

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Gated model needs accepted license + auth token

Diagnose

Error: `401 Client Error: Unauthorized for url: https://huggingface.co/...` or 'Cannot access gated repo.' Llama, Gemma, some Flux variants are gated.

Fix

Visit the model's HuggingFace page in browser. Accept the license (form on the page). Generate a token at huggingface.co/settings/tokens. Then `huggingface-cli login` or `export HF_TOKEN=hf_xxxxx` before download.

Rate limit on anonymous downloads

Diagnose

Error: `429 Too Many Requests` or 'Rate limit exceeded.' Anonymous traffic is capped; serious downloaders need a free account + token.

Fix

Create free HF account. Generate token. Use it: `HF_TOKEN=hf_xxxxx`. Authenticated rate limits are dramatically higher.

Network block / corporate proxy

Diagnose

Connection times out or fails with DNS error. Other sites work fine. `curl https://huggingface.co` hangs.

Fix

Configure proxy: `export HTTPS_PROXY=http://proxy.host:port`. Or use HF mirror: `export HF_ENDPOINT=https://hf-mirror.com` (community mirror). Verify with `curl $HF_ENDPOINT`.

Disk full mid-download

Diagnose

Download stops at e.g. 60%. Re-running fails immediately on the partial cache.

Fix

Free disk. Clear partial downloads: `rm -rf ~/.cache/huggingface/hub/<broken-model>`. Resume with `huggingface-cli download` (resume-friendly) instead of `git clone`.

Slow connection causing timeouts

Diagnose

Download speed drops to KB/s, eventually fails. Large repos (70B = 40+ GB) sensitive to flaky connections.

Fix

Use `huggingface-cli download <repo> --resume-download`. Or `hf_transfer` (Rust-based, faster): `pip install hf-transfer && export HF_HUB_ENABLE_HF_TRANSFER=1`. Then re-run.

429 rate limit from repeated anonymous downloads (HF's anti-abuse throttle)

Diagnose

Error 429 on every request even after waiting. HuggingFace's rate limiter tracks IP + user-agent patterns. If you've been hammering anonymous downloads (especially with parallel downloaders like `huggingface-cli download --resume-download` or aria2c with high concurrency), your IP is temporarily throttled.

Fix

Create a free HuggingFace account, generate a token at huggingface.co/settings/tokens (read scope is enough for downloads), and authenticate: `huggingface-cli login` or `export HF_TOKEN=hf_xxxxx`. Authenticated users get dramatically higher rate limits. Also lower the concurrency of your downloader — 4-8 parallel connections max, not 16.

Your IP range has been temporarily rate-limited at the CDN level (not just HF's application layer)

Diagnose

429 errors persist even with a valid HF_TOKEN. The CDN layer (Cloudflare in front of cdn-lfs.huggingface.co) applies its own rate limiting independent of HuggingFace's application-level auth. This is rarer but hits users on shared IPs (corporate VPNs, university networks, some ISPs).

Fix

Switch networks: try your phone hotspot, a different VPN exit node, or wait 30-60 minutes for the CDN throttle to cool. You can also use the HF mirror: `export HF_ENDPOINT=https://hf-mirror.com` (community-maintained mirror that routes through different CDN). Verify with `curl -I $HF_ENDPOINT/<org>/<repo>/resolve/main/README.md`.

Frequently asked questions

Do I need a HuggingFace account to download models?

Not strictly — most public models work anonymously. But: gated models (Llama, Gemma, etc.) require an account + accepted license. Rate-limited heavy downloads benefit from a token. Free account + token is the practical default.

What's the fastest way to download a large HF repo?

Install `hf_transfer` (Rust-based, parallelized): `pip install hf-transfer` then `HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download <repo>`. Often 5-10x faster than the default Python downloader on fast connections.

Where does HuggingFace cache models on my machine?

Default: `~/.cache/huggingface/hub/`. Override with `HF_HOME=/path/to/cache`. Each model is stored under `models--<org>--<repo>/`. To free space, delete specific model folders or use `huggingface-cli delete-cache`.

How many parallel downloads can I run before HuggingFace rate-limits me?

Authenticated: 4-8 concurrent connections are generally safe. 16+ parallel connections from the same IP (especially with a download accelerator like aria2c at `-x 16`) will trigger rate limiting within minutes. Anonymous: don't go above 2-4 concurrent connections and spread downloads over time. The rate limiter tracks request frequency, not bandwidth consumed.

Can I use a download mirror to bypass HuggingFace's rate limits entirely?

Yes. The community-maintained mirror at hf-mirror.com mirrors the LFS files on a different CDN. Set `export HF_ENDPOINT=https://hf-mirror.com` before any `huggingface-cli` command. Note: mirrors typically lag the official source by hours to a day, so brand-new model uploads may not be mirrored immediately.

Why is my HF token being rejected even though I just created it?

Three common causes: (1) The token was created with 'read' scope but the model is gated behind a license acceptance — visit the model page in a browser, accept the license, then try again. (2) You're using a 'fine-grained' token scoped to a specific repo and you're accessing a different repo. (3) The token was generated but the `HF_TOKEN` env var has a trailing space or newline — `echo $HF_TOKEN | xxd | head` to verify raw bytes.

Related troubleshooting

Ollama: 'model not found' / 'pull manifest unknown' errors

Ollama 'model not found' errors trace to typos in the model name, pulling a model that doesn't exist in the official registry, network blocks on the registry, or pulling from a custom registry without auth.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

GGUF tokenizer mismatch / 'tokenizer model not found'

When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?