What is prompt caching?

Prompt caching lets an LLM reuse the computed state of a repeated prompt prefix across calls, so a long, stable system prompt or document is processed once and replayed cheaply on subsequent requests.

Prompt caching is a provider feature that stores the model's internal representation of a prompt prefix so it does not have to be recomputed on every call. Many workloads send the same large prefix repeatedly, a long system prompt, a tool catalog, a big reference document, followed by a short variable suffix. With caching, the first request processes the full prefix and the provider caches it; later requests that share the same prefix skip that work, returning faster and at a steep discount on the cached input tokens (often a fraction of the normal input price), while a small write surcharge may apply on the first call. To benefit you must keep the cached portion byte-identical and at the start of the prompt, the cache matches on an exact prefix, so any change near the top invalidates everything after it. The practical pattern is to order prompts from most stable to most volatile: system prompt and few-shot examples and tool definitions first (cacheable), then the live conversation or query last. Caches are typically short-lived (minutes), so they help bursty, repetitive traffic most. For agents and RAG systems, caching the fixed scaffolding pairs well with context compaction of the changing tail, cutting both latency and cost on every turn.