What's the real latency tax of cloud LLM APIs vs running locally?
The answer
One paragraph. No hedging beyond what the data actually warrants.
The latency tax is real: 200-1000ms TTFT round-trip overhead on cloud, vs 30-100ms locally. It matters more than you think for agents.
Where the latency comes from:
Cloud round-trip:
- DNS resolution + TLS handshake: ~30-80ms on a warm connection
- Network round-trip: ~20-50ms from US East coast, 80-150ms transatlantic
- Provider queue: ~50-200ms during normal load, 500ms+ during peak
- Model prefill: ~100-400ms depending on prompt size
- First token streaming: ~80-200ms
Typical end-to-end TTFT (time to first token):
- Anthropic / OpenAI / Google APIs: 400-800ms steady state, 1-2s during peak hours
- Groq: 200-300ms (their low-latency pitch is real)
- Local Ollama: 50-100ms (no network, no queue)
- Local vLLM with prefix cache hit: 30-60ms
Why this matters per workload:
| Workload | Latency sensitivity | Local advantage |
|---|---|---|
| Chat UI (single message) | Low | Marginal — humans don't notice 500ms |
| Coding autocomplete (FIM) | High | Huge — every 200ms hurts type-flow |
| Coding agent (multi-step) | High | Compounds — 8 steps × 600ms cloud delay = 5s wasted per edit |
| Voice-to-voice | Critical | Difference between "real conversation" and "walkie-talkie" |
| Batch processing | None | None — latency averages out |
| RAG over docs | Medium | Modest — most time is in generation, not first-token |
The "1000ms tax" framing: for agent workloads (Aider, Cline, Continue) that make 8-12 model calls per edit, the cloud round-trip overhead alone accounts for 4-12 seconds of wall-clock time. Local AI eliminates this entirely. That's why operators who use coding agents heavily often move to local even when the per-token cost is higher.
The decision rule:
- Latency-critical work (autocomplete, voice, agents): local wins regardless of model quality.
- Quality-critical work (one-shot analysis, complex reasoning): cloud frontier wins; latency doesn't matter.
- Cost-sensitive work (batch, high-volume classification): depends on volume — see /compounder for the cross-over math.
The compounding effect is real but the specific multiplier depends on the workflow. Community reports from operators who've migrated cloud-coding-agent workflows to local frequently mention "feels noticeably snappier" or "more iterations per hour" — but specific productivity-per-hour numbers we cite anywhere should be your own measurements. We don't have controlled before/after data we'd publish as a single percentage.
Explore the numbers for your specific stack
Where we got the numbers
TTFT measurements: anthropic.com/pricing + OpenAI status page metrics + community benchmarks on r/LocalLLaMA 2026. Groq numbers from artificialanalysis.ai TTFT leaderboard. Cline / Aider edit-rate observation from community productivity threads.
Also see
The 5 agents where local wins on latency over cloud (Aider, Cline, Continue, Tabby, Twinny).
Where latency tax is critical — sub-1.5s total response is achievable locally.
When local also beats cloud on cost (it usually does past a daily-volume threshold).
TCO compounder — pure cost math, plus the latency-as-productivity multiplier.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.