What's the real latency tax of cloud LLM APIs vs running locally?

The answer

One paragraph. No hedging beyond what the data actually warrants.

The latency tax is real: 200-1000ms TTFT round-trip overhead on cloud, vs 30-100ms locally. It matters more than you think for agents.

Where the latency comes from:

Cloud round-trip:

DNS resolution + TLS handshake: ~30-80ms on a warm connection
Network round-trip: ~20-50ms from US East coast, 80-150ms transatlantic
Provider queue: ~50-200ms during normal load, 500ms+ during peak
Model prefill: ~100-400ms depending on prompt size
First token streaming: ~80-200ms

Typical end-to-end TTFT (time to first token):

Anthropic / OpenAI / Google APIs: 400-800ms steady state, 1-2s during peak hours
Groq: 200-300ms (their low-latency pitch is real)
Local Ollama: 50-100ms (no network, no queue)
Local vLLM with prefix cache hit: 30-60ms

Why this matters per workload:

Workload	Latency sensitivity	Local advantage
Chat UI (single message)	Low	Marginal — humans don't notice 500ms
Coding autocomplete (FIM)	High	Huge — every 200ms hurts type-flow
Coding agent (multi-step)	High	Compounds — 8 steps × 600ms cloud delay = 5s wasted per edit
Voice-to-voice	Critical	Difference between "real conversation" and "walkie-talkie"
Batch processing	None	None — latency averages out
RAG over docs	Medium	Modest — most time is in generation, not first-token

The "1000ms tax" framing: for agent workloads (Aider, Cline, Continue) that make 8-12 model calls per edit, the cloud round-trip overhead alone accounts for 4-12 seconds of wall-clock time. Local AI eliminates this entirely. That's why operators who use coding agents heavily often move to local even when the per-token cost is higher.

The decision rule:

Latency-critical work (autocomplete, voice, agents): local wins regardless of model quality.
Quality-critical work (one-shot analysis, complex reasoning): cloud frontier wins; latency doesn't matter.
Cost-sensitive work (batch, high-volume classification): depends on volume — see /compounder for the cross-over math.

The compounding effect is real but the specific multiplier depends on the workflow. Community reports from operators who've migrated cloud-coding-agent workflows to local frequently mention "feels noticeably snappier" or "more iterations per hour" — but specific productivity-per-hour numbers we cite anywhere should be your own measurements. We don't have controlled before/after data we'd publish as a single percentage.

Explore the numbers for your specific stack

Open the stream visualizer →

Watch tokens stream at the local-hardware-actual speed side-by-side with Claude / GPT-5 / Groq. The visceral feel of the latency tax.

Where we got the numbers

TTFT measurements: anthropic.com/pricing + OpenAI status page metrics + community benchmarks on r/LocalLLaMA 2026. Groq numbers from artificialanalysis.ai TTFT leaderboard. Cline / Aider edit-rate observation from community productivity threads.

Also see

Coding agents for local →

The 5 agents where local wins on latency over cloud (Aider, Cline, Continue, Tabby, Twinny).

Voice-to-voice pipeline →

Where latency tax is critical — sub-1.5s total response is achievable locally.

Cost vs cloud calculator →

When local also beats cloud on cost (it usually does past a daily-volume threshold).

When does local pay back? →

TCO compounder — pure cost math, plus the latency-as-productivity multiplier.

What's the real latency tax of cloud LLM APIs vs running locally?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread