Gemini Context Caching: Reduce Costs by 75%

Master Gemini's context caching API. Learn cache-optimized prompt structures, cache hit maximization, invalidation patterns, and cost comparisons for long-context workflows.

June 14, 2026
GeminiContext CachingCost OptimizationAPIPrompt Engineering

Gemini's context caching is one of the most impactful cost optimization features available for any LLM API. When you send the same large prefix across multiple requests — a system prompt plus reference documents, for example — caching can reduce your input token costs by up to 75%. The trade-off: you need to structure your prompts so the cacheable prefix stays identical across requests.

How Context Caching Works

Normally, every Gemini API request processes your entire prompt from scratch. With caching enabled, Gemini identifies a reusable prefix — the part of your prompt that doesn't change between requests — and stores its processed representation. Subsequent requests with the same prefix skip recomputation of those tokens.

The cacheable prefix must be the beginning of your content. You can't cache a middle section or a suffix. This constraint shapes how you structure prompts.

CACHEABLE (same across all requests):
├── System instruction
├── Reference documents
├── Domain knowledge
└── Context documents

NOT CACHEABLE (changes per request):
└── User query

Setting Up Context Caching

Context caching in Gemini works via a two-step process: first create a cached content resource, then reference it in subsequent requests.

Step 1: Create the Cached Content

# Create a cached content resource with the shared prefix
cached_content = client.cachedContents.create(
    model="gemini-2.5-flash",
    config={
        "contents": [
            {"role": "user", "parts": [{"text": system_instruction}]},
            {"role": "user", "parts": [{"text": reference_documents}]},
        ],
        "ttl": "3600s"  # 1 hour
    }
)
cache_name = cached_content.name  # e.g., "cachedContents/xyz-123"

The TTL (time-to-live) determines how long the cache persists. Minimum is implementation-specific; typical values range from 300s (5 minutes) to 86400s (24 hours). Shorter TTLs cost less to maintain; longer TTLs amortize better for sustained workloads.

Step 2: Use the Cache in Requests

# Reference the cached prefix when generating
response = client.models.generate_content(
    model="gemini-2.5-flash",
    config={
        "cached_content": cache_name
    },
    contents=[{"role": "user", "parts": [{"text": user_query}]}]
)

The cacheable prefix must be the beginning of your content. You can't cache a middle section or a suffix. This constraint shapes how you structure prompts.

Cache-Optimized Prompt Structure

# CACHEABLE PREFIX (same for all requests)
system_instruction = """
You are a legal document analyst specializing in commercial
lease agreements in California...
"""

reference_documents = """
[DOC: ca-commercial-code-2024]
... full text of relevant statutes ...

[DOC: standard-lease-template]
... full text of the standard lease form ...

[DOC: case-law-summaries]
... summaries of 50 relevant cases ...
"""

cached_prefix = system_instruction + reference_documents

# PER-REQUEST SUFFIX (changes each time)
user_query = "Review this lease for compliance issues: [LEASE TEXT]"

# Send request with cached prefix + new query

Note:

Always place your system instruction inside the cacheable prefix, even for individual requests. This ensures consistent behavior across cached and non-cached requests. If your system prompt changes between requests, you can't use caching.

Maximizing Cache Hits

1. Fixed Structure, Variable Content

The most common caching pattern: keep the structure constant, but swap reference documents.

# Cacheable: system instruction + document schema
cached_prefix = """
System instruction...
Output format specification...
Document analysis framework...
"""

# Per-request: only the document changes
documents = [load_document(doc_id) for doc_id in document_ids]
for doc in documents:
    response = gemini.generate(cached_prefix + doc + analysis_query)

2. User Session Caching

For multi-turn conversations with a persistent system prompt and reference set:

# Create cache for the session
session_cache = {
    "system_instruction": "...",
    "project_documents": [...],
    "conversation_history": [...],  # Growing but still a prefix
    "ttl": "3600s"
}

# Each user turn appends to conversation history
# Only the latest user message isn't cached

3. Batch Processing with Shared Context

# Cache the shared analysis framework
framework = """
Classification schema: ...
Evaluation criteria: ...
Output template: ...
"""

# Process 1000 documents, each as a separate request
for doc in large_document_set:
    response = gemini.generate(framework + doc + "Classify this document.")

Cache Invalidation

Caches expire based on TTL. But you may want to invalidate earlier:

  • Content updates: reference documents changed
  • User session ended: free cached resources
  • Cost management: cache costs exceed savings for infrequent requests

Invalidate by setting a short TTL and letting it expire, or using the API's cache management endpoints if available.

When NOT to Use Caching

Caching isn't always worth it:

ScenarioRecommendation
One-off queries with unique contextSkip caching — no reuse value
Frequently changing system promptsCan't cache — prefix always changes
Very short prompts (< 1K tokens)Overhead may exceed savings
Highly dynamic reference docsCache invalidation churn erases benefit
Latency-sensitive real-time appsCache creation adds initial latency

Cost Comparison

For a workflow processing 1,000 documents with a 50K-token shared prefix:

ApproachInput TokensCost (relative)
No caching1,000 × 50K = 50M100%
With caching50K (cached once) + 1,000 × 2K (variable) = 2.05M~4%

This is the ideal case — your effective savings depend on prefix-to-suffix ratio and request frequency.

Note:

Caching has its own cost: you're charged for cache storage based on token count and TTL duration. For infrequent workloads (a few requests per hour with a large prefix), storage costs can exceed the savings from skipped recomputation. Calculate both sides before committing to a caching strategy.

Common Failures

FailureCauseFix
Low cache hit ratePrefix changes between requests (timestamps, IDs, counters)Strip all dynamic content from the cacheable prefix
TTL too shortCache expires before you use itSet TTL to 2x your expected batch duration
TTL too longPaying for cache storage on idle dataMonitor usage; set TTL to match actual access patterns
Wrong prefix boundaryTrying to cache from the middleAlways place cacheable content at the very start