What's the real latency tax of cloud LLM APIs vs running locally?

Reviewed May 15, 20262 min read
latencyttftlocal-vs-cloudagentsproductivity

The answer

One paragraph. No hedging beyond what the data actually warrants.

The latency tax is real: 200-1000ms TTFT round-trip overhead on cloud, vs 30-100ms locally. It matters more than you think for agents.

Where the latency comes from:

Cloud round-trip:

  • DNS resolution + TLS handshake: ~30-80ms on a warm connection
  • Network round-trip: ~20-50ms from US East coast, 80-150ms transatlantic
  • Provider queue: ~50-200ms during normal load, 500ms+ during peak
  • Model prefill: ~100-400ms depending on prompt size
  • First token streaming: ~80-200ms

Typical end-to-end TTFT (time to first token):

  • Anthropic / OpenAI / Google APIs: 400-800ms steady state, 1-2s during peak hours
  • Groq: 200-300ms (their low-latency pitch is real)
  • Local Ollama: 50-100ms (no network, no queue)
  • Local vLLM with prefix cache hit: 30-60ms

Why this matters per workload:

Workload Latency sensitivity Local advantage
Chat UI (single message) Low Marginal — humans don't notice 500ms
Coding autocomplete (FIM) High Huge — every 200ms hurts type-flow
Coding agent (multi-step) High Compounds — 8 steps × 600ms cloud delay = 5s wasted per edit
Voice-to-voice Critical Difference between "real conversation" and "walkie-talkie"
Batch processing None None — latency averages out
RAG over docs Medium Modest — most time is in generation, not first-token

The "1000ms tax" framing: for agent workloads (Aider, Cline, Continue) that make 8-12 model calls per edit, the cloud round-trip overhead alone accounts for 4-12 seconds of wall-clock time. Local AI eliminates this entirely. That's why operators who use coding agents heavily often move to local even when the per-token cost is higher.

The decision rule:

  • Latency-critical work (autocomplete, voice, agents): local wins regardless of model quality.
  • Quality-critical work (one-shot analysis, complex reasoning): cloud frontier wins; latency doesn't matter.
  • Cost-sensitive work (batch, high-volume classification): depends on volume — see /compounder for the cross-over math.

The compounding effect is real but the specific multiplier depends on the workflow. Community reports from operators who've migrated cloud-coding-agent workflows to local frequently mention "feels noticeably snappier" or "more iterations per hour" — but specific productivity-per-hour numbers we cite anywhere should be your own measurements. We don't have controlled before/after data we'd publish as a single percentage.

Where we got the numbers

TTFT measurements: anthropic.com/pricing + OpenAI status page metrics + community benchmarks on r/LocalLLaMA 2026. Groq numbers from artificialanalysis.ai TTFT leaderboard. Cline / Aider edit-rate observation from community productivity threads.

Other questions in this thread

Other /q/ landings on the same topic — same editorial discipline.

Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.