Gemini's context caching is one of the most impactful cost optimization features available for any LLM API. When you send the same large prefix across multiple requests — a system prompt plus reference documents, for example — caching can reduce your input token costs by up to 75%. The trade-off: you need to structure your prompts so the cacheable prefix stays identical across requests.

How Context Caching Works

Normally, every Gemini API request processes your entire prompt from scratch. With caching enabled, Gemini identifies a reusable prefix — the part of your prompt that doesn't change between requests — and stores its processed representation. Subsequent requests with the same prefix skip recomputation of those tokens.

The cacheable prefix must be the beginning of your content. You can't cache a middle section or a suffix. This constraint shapes how you structure prompts.

CACHEABLE (same across all requests):
├── System instruction
├── Reference documents
├── Domain knowledge
└── Context documents

NOT CACHEABLE (changes per request):
└── User query

Setting Up Context Caching

Context caching in Gemini works via a two-step process: first create a cached content resource, then reference it in subsequent requests.

Step 1: Create the Cached Content

# Create a cached content resource with the shared prefix
cached_content = client.cachedContents.create(
    model="gemini-2.5-flash",
    config={
        "contents": [
            {"role": "user", "parts": [{"text": system_instruction}]},
            {"role": "user", "parts": [{"text": reference_documents}]},
        ],
        "ttl": "3600s"  # 1 hour
    }
)
cache_name = cached_content.name  # e.g., "cachedContents/xyz-123"

The TTL (time-to-live) determines how long the cache persists. Minimum is implementation-specific; typical values range from 300s (5 minutes) to 86400s (24 hours). Shorter TTLs cost less to maintain; longer TTLs amortize better for sustained workloads.

Step 2: Use the Cache in Requests

# Reference the cached prefix when generating
response = client.models.generate_content(
    model="gemini-2.5-flash",
    config={
        "cached_content": cache_name
    },
    contents=[{"role": "user", "parts": [{"text": user_query}]}]
)

The cacheable prefix must be the beginning of your content. You can't cache a middle section or a suffix. This constraint shapes how you structure prompts.

Cache-Optimized Prompt Structure

# CACHEABLE PREFIX (same for all requests)
system_instruction = """
You are a legal document analyst specializing in commercial
lease agreements in California...
"""

reference_documents = """
[DOC: ca-commercial-code-2024]
... full text of relevant statutes ...

[DOC: standard-lease-template]
... full text of the standard lease form ...

[DOC: case-law-summaries]
... summaries of 50 relevant cases ...
"""

cached_prefix = system_instruction + reference_documents

# PER-REQUEST SUFFIX (changes each time)
user_query = "Review this lease for compliance issues: [LEASE TEXT]"

# Send request with cached prefix + new query

Note:

Always place your system instruction inside the cacheable prefix, even for individual requests. This ensures consistent behavior across cached and non-cached requests. If your system prompt changes between requests, you can't use caching.

Maximizing Cache Hits

1. Fixed Structure, Variable Content

The most common caching pattern: keep the structure constant, but swap reference documents.

# Cacheable: system instruction + document schema
cached_prefix = """
System instruction...
Output format specification...
Document analysis framework...
"""

# Per-request: only the document changes
documents = [load_document(doc_id) for doc_id in document_ids]
for doc in documents:
    response = gemini.generate(cached_prefix + doc + analysis_query)

2. User Session Caching

For multi-turn conversations with a persistent system prompt and reference set:

# Create cache for the session
session_cache = {
    "system_instruction": "...",
    "project_documents": [...],
    "conversation_history": [...],  # Growing but still a prefix
    "ttl": "3600s"
}

# Each user turn appends to conversation history
# Only the latest user message isn't cached

3. Batch Processing with Shared Context

# Cache the shared analysis framework
framework = """
Classification schema: ...
Evaluation criteria: ...
Output template: ...
"""

# Process 1000 documents, each as a separate request
for doc in large_document_set:
    response = gemini.generate(framework + doc + "Classify this document.")

Cache Invalidation

Caches expire based on TTL. But you may want to invalidate earlier:

Content updates: reference documents changed
User session ended: free cached resources
Cost management: cache costs exceed savings for infrequent requests

Invalidate by setting a short TTL and letting it expire, or using the API's cache management endpoints if available.

When NOT to Use Caching

Caching isn't always worth it:

Scenario	Recommendation
One-off queries with unique context	Skip caching — no reuse value
Frequently changing system prompts	Can't cache — prefix always changes
Very short prompts (< 1K tokens)	Overhead may exceed savings
Highly dynamic reference docs	Cache invalidation churn erases benefit
Latency-sensitive real-time apps	Cache creation adds initial latency

Cost Comparison

For a workflow processing 1,000 documents with a 50K-token shared prefix:

Approach	Input Tokens	Cost (relative)
No caching	1,000 × 50K = 50M	100%
With caching	50K (cached once) + 1,000 × 2K (variable) = 2.05M	~4%

This is the ideal case — your effective savings depend on prefix-to-suffix ratio and request frequency.

Note:

Caching has its own cost: you're charged for cache storage based on token count and TTL duration. For infrequent workloads (a few requests per hour with a large prefix), storage costs can exceed the savings from skipped recomputation. Calculate both sides before committing to a caching strategy.

Common Failures

Failure	Cause	Fix
Low cache hit rate	Prefix changes between requests (timestamps, IDs, counters)	Strip all dynamic content from the cacheable prefix
TTL too short	Cache expires before you use it	Set TTL to 2x your expected batch duration
TTL too long	Paying for cache storage on idle data	Monitor usage; set TTL to match actual access patterns
Wrong prefix boundary	Trying to cache from the middle	Always place cacheable content at the very start

1M Token Strategies — Foundation for long-context prompt structure
Large Document Analysis — Workflows that benefit from caching

Gemini Context Caching: Reduce Costs by 75%

How Context Caching Works

Setting Up Context Caching

Step 1: Create the Cached Content

Step 2: Use the Cache in Requests

Cache-Optimized Prompt Structure

Maximizing Cache Hits

1. Fixed Structure, Variable Content

2. User Session Caching

3. Batch Processing with Shared Context

Cache Invalidation

When NOT to Use Caching

Cost Comparison

Common Failures

Related Articles

Indie & Art House Cinematic SREF Codes

1940s-1950s Mid-Century & Post-War SREF Codes

Prompt Chaining: Multi-Step AI Workflows

On this page

Gemini Context Caching: Reduce Costs by 75%

How Context Caching Works

Setting Up Context Caching

Step 1: Create the Cached Content

Step 2: Use the Cache in Requests

Cache-Optimized Prompt Structure

Maximizing Cache Hits

1. Fixed Structure, Variable Content

2. User Session Caching

3. Batch Processing with Shared Context

Cache Invalidation

When NOT to Use Caching

Cost Comparison

Common Failures

Related Pages

Related Articles

Indie & Art House Cinematic SREF Codes

1940s-1950s Mid-Century & Post-War SREF Codes

Prompt Chaining: Multi-Step AI Workflows

On this page