Back to blog

Friday, December 17, 1915

Prompt Caching: Cut LLM Costs by 90% Without Changing Your Prompts

cover

If you're sending the same system prompt, tool definitions, or few-shot examples with every API call, you're paying to reprocess static content that hasn't changed since the last request. Prompt caching fixes that — the provider recognizes the repeated prefix and skips reprocessing it entirely.

Cached reads cost 50-90% less and arrive up to 80% faster. The mechanism is the same across providers: the model's key-value attention cache for a prefix gets reused when the same prefix appears again. What differs is how you opt in.

How It Actually Works

When an LLM processes your prompt, it computes attention key-value tensors for every token. Those tensors represent the model's understanding of the prefix — its internal state after reading the system prompt, context, and conversation history. Normally, those tensors are discarded after the response. With caching, they're retained.

On the next request, the provider hashes the prompt prefix. If the hash matches a cached entry (and the cache hasn't expired), the model skips the full prefill pass for that prefix. It picks up from the cached KV tensors and only processes the uncached suffix.

Three mechanics matter for your strategy:

  1. Prefix matching, not content matching. The cache key is the exact prefix from token 0 to the cache boundary. "System prompt → user message" and "user message → system prompt" are different prefixes even if they contain the same text.

  2. Lookback windows. Anthropic's explicit breakpoints walk backward 20 blocks looking for a prior cache entry. OpenAI uses a hash of the first ~256 tokens for routing. In both cases, a match requires an earlier request that wrote the cache — you can't cache content that no request has ever sent.

  3. Cache invalidation cascades downward. If you change tool definitions, the entire cache invalidates. If you change the system prompt, both system + messages caches invalidate. If you change only the user message at the end, the system + tools caches survive. Static-first = resilient caches.

Provider Landscape

All three major providers support caching, but their APIs treat it at different levels of abstraction:

OpenAIAnthropicGoogle Gemini
MechanismAutomatic prefix matchingExplicit cache_control breakpoints, or autoNamed cache resource, created explicitly
Opt-inNone (auto for ≥1024 tokens)Add cache_control: { type: "ephemeral" }cachedContent.create() API call
Default TTL5-10 min in-memory5 minUser-defined on creation
Extended TTLUp to 24h (models ≥ gpt-4.1)1h (2x write cost)Configurable up to hours
Cache read cost50% discount (automatic)10% of base input priceStorage cost per hour + reduced input rate
Cache write costFree1.25x base input priceStorage cost only
Min token threshold10241024-4096 (model-dependent)Varies
Key differentiatorprompt_cache_key for request groupingPre-warming via max_tokens: 0Full lifecycle control, reference by name

The API shape differs, but the application-layer strategy is identical: position static content first, keep it byte-for-byte identical across requests, and group related requests so they hit the same cache.

The Cost Difference Is Real

Consider a typical production agent: a 5,000-token system prompt (instructions, tool definitions, few-shot examples) followed by a 300-token user query. You process 1,000 requests/hour.

Without caching (all three providers, approximate):

  • 5,300 input tokens × 1,000 requests = 5.3M input tokens/hour
  • Anthropic Sonnet: ~$15.90/hour
  • OpenAI gpt-4o: ~$13.25/hour

With caching, assuming 90% cache hit rate:

  • Cache reads: 5,000 tokens cached × 900 requests = 4.5M cached read tokens

  • Uncached: (5,000 write + 300 query) × 100 missed requests + 300 × 900 query tokens on hits = 530K + 270K = 800K uncached tokens

  • Anthropic Sonnet (5-min cache): 4.5M × $0.30/Mtok + 0.5M × $1.25/Mtok (writes) + 0.53M × $3/Mtok = $1.35 + $0.625 + $1.59 = $3.57/hour (78% savings)

  • OpenAI gpt-4o (automatic): Roughly 50% discount on cached portion → ~$6.60/hour (50% savings)

For a 5,000-token system prefix, the break-even on Anthropic's 1.25x write tax is roughly 2 requests within the TTL. After that, every hit is pure savings at 10% of base price.

The Harness Strategy

Provider APIs differ, but the application layer doesn't care. Here's the pattern that maximizes cache hits regardless of which API you're calling:

1. Static-First Prompt Layout

The single most impactful decision. Order every prompt the same way:

[1] System instructions        ← never changes
[2] Tool definitions           ← changes rarely
[3] Context / documents        ← changes per session
[4] Conversation history       ← grows linearly
[5] Few-shot examples          ← never changes
[6] Current user message       ← changes every request

Everything from [1] through [5] can be cached. [6] is the only uncached suffix. Anthropic users: place your cache_control breakpoint at the end of [5]. OpenAI users: this ordering is what the automatic prefix matcher expects.

2. Stable Serialization

Cache keys are exact match. If your tool definitions serialize with randomized key ordering (Go and Swift do this natively), every request has a different prefix and you never hit. Pin your JSON serialization to sorted keys.

# If you're building tool definitions dynamically, use sort_keys
import json
tools_json = json.dumps(tools, sort_keys=True)

Same goes for messages. If you're programmatically constructing the messages array, ensure identical ordering and formatting across requests that should share a cache.

3. Request Grouping with Cache Keys

OpenAI's prompt_cache_key lets you group unrelated requests that share a prefix. Without it, two users hitting different endpoints but sharing the same system prompt might route to different machines. With it, they're routed together:

# All requests sharing this system prompt get routed to the same machine
response = client.responses.create(
    model="gpt-5.5",
    prompt_cache_key="support-agent-v2",
    input=system_prompt + user_message
)

Choose a key that's scoped to shared content, not individual users. "support-agent-v2" is right. "user-1234" defeats the purpose.

4. Pre-Warm for Latency-Critical Paths

The first request after cache expiry is a cache miss — full latency, full price. If your app has a cold-start problem (think: chatbot first message after deploy), pre-warm the cache before users arrive.

Anthropic makes this explicit with max_tokens: 0:

# Run this after deploy, before routing user traffic
client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=0,
    system=[{
        "type": "text",
        "text": system_prompt,
        "cache_control": {"type": "ephemeral"}
    }],
    messages=[{"role": "user", "content": "warmup"}]
)

OpenAI doesn't have a pre-warm API, but sending one request per unique prefix right after deploy has the same effect — it seeds the cache before users generate real load.

5. Monitor Cache Hit Rates

Every provider exposes cache usage in the response. Log it.

# OpenAI
cached = response.usage.prompt_tokens_details.cached_tokens
hit_rate = cached / (cached + response.usage.prompt_tokens)

# Anthropic
cache_read = response.usage.cache_read_input_tokens
cache_write = response.usage.cache_creation_input_tokens
fresh = response.usage.input_tokens
hit_rate = cache_read / (cache_read + cache_write + fresh)

If your hit rate is below 50%, either your prompts aren't ordered static-first, you're sending fewer than the minimum token threshold, or your request volume is too low to keep the cache warm.

When Caching Doesn't Help

Caching has zero effect in these scenarios:

  • Short prompts (<1024 tokens). The minimum threshold varies by model (1024-4096) but the rule is the same: if your entire prompt is below the minimum, caching is skipped silently. No error, just no benefit.

  • Every request has a unique prefix. If each call has a different system prompt, different tool set, or different examples, there's nothing to reuse. This is common in "prompt builder" UIs where users assemble prompts from scratch each time.

  • Very low request volume. Cache TTLs are measured in minutes. If you send one request every 10 minutes, the cache expires between calls. You need sustained throughput to benefit.

  • Images or file uploads that change per request. In OpenAI, images are part of the prefix. If every request has a different image (even if the text is identical), no cache hit. Batch image analysis is a poor fit for caching.

  • Streaming-first applications. Cache hits reduce time-to-first-token latency, but if you're streaming tokens and the user is reading as they arrive, the perceived speedup is minimal. The cost savings still apply.

Pitfalls That Kill Your Cache

The timestamp trap. Putting a timestamp in your system prompt — even buried in a "Current date: 2026-06-10" line — changes the prefix every request. Move timestamps to the end (the user message), or omit them and let the model reason about time from your app's context.

The image trap. Adding or removing an image anywhere in the prompt invalidates the entire cache on both OpenAI and Anthropic. If you're sending images, keep them stable across requests or batch them into sessions where they don't change.

The tool_choice trap. Changing tool_choice from "auto" to "required" invalidates the messages cache on Anthropic. OpenAI's behavior is similar. Pick a tool_choice and stick with it per deployment.

The thinking toggle. Enabling or disabling extended thinking (or changing the thinking budget) invalidates the system cache on Anthropic. If you toggle thinking per request, you're paying a cache write on every call.

The Go/Swift serialization trap. map[string]interface{} serialized to JSON has non-deterministic key ordering. Two identical tool definitions can produce different JSON bytes, different prefix hashes, and a cache miss. Sort your keys.

Building a Cache-Aware Prompt Harness

If you're building a framework that abstracts over multiple providers, the caching behavior can be unified. Here's a minimal example:

import hashlib, time
from dataclasses import dataclass, field
from typing import Any

@dataclass
class PromptTemplate:
    system: str
    tools: list[dict] = field(default_factory=list)
    examples: list[dict] = field(default_factory=list)

    def build_messages(self, user_input: str, context: str = "") -> list[dict]:
        # Static content first → cacheable prefix
        messages = [{"role": "system", "content": self.system}]

        for ex in self.examples:
            messages.append({"role": "user", "content": ex["input"]})
            messages.append({"role": "assistant", "content": ex["output"]})

        if context:
            messages.append({"role": "system", "content": context})

        # Dynamic content last → uncached suffix
        messages.append({"role": "user", "content": user_input})
        return messages

    def cache_key(self) -> str:
        # Deterministic hash of the cacheable prefix
        prefix = json.dumps({
            "system": self.system,
            "tools": self.tools,
            "examples": self.examples
        }, sort_keys=True)
        return hashlib.sha256(prefix.encode()).hexdigest()[:16]


class CacheTracker:
    def __init__(self):
        self.hits = 0
        self.misses = 0
        self.last_warm = 0

    def record(self, response_usage: dict[str, Any]) -> None:
        cached = response_usage.get("cache_read_input_tokens", 0) \
              or response_usage.get("prompt_tokens_details", {}).get("cached_tokens", 0)

        if cached > 0:
            self.hits += 1
        else:
            self.misses += 1

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0

    def maybe_prewarm(self, template: PromptTemplate, client, ttl_seconds: int = 240) -> None:
        """Pre-warm if TTL has likely expired."""
        if time.time() - self.last_warm > ttl_seconds:
            # Provider-specific pre-warm call would go here
            self.last_warm = time.time()

The PromptTemplate class enforces static-first ordering — system, then examples, then context, then user input. The cache_key() method generates a deterministic hash of the cacheable prefix, usable as OpenAI's prompt_cache_key or for your own routing. CacheTracker wraps usage monitoring so you know if your strategy is working.

The Bottom Line

Prompt caching isn't a feature you build — it's a property of how you structure your prompts. Sort static content first, keep it identical across requests, group related traffic, and monitor hit rates. The provider handles the rest.

The 10x cost difference between cached and uncached reads means this pays for itself immediately on any workload above a few requests per minute. The only mistake is leaving money on the table by ignoring it.