Gemini Context Caching: Reduce Costs by 75%
Master Gemini's context caching API. Learn cache-optimized prompt structures, cache hit maximization, invalidation patterns, and cost comparisons for long-context workflows.
Gemini's context caching is one of the most impactful cost optimization features available for any LLM API. When you send the same large prefix across multiple requests — a system prompt plus reference documents, for example — caching can reduce your input token costs by up to 75%. The trade-off: you need to structure your prompts so the cacheable prefix stays identical across requests.
How Context Caching Works
Normally, every Gemini API request processes your entire prompt from scratch. With caching enabled, Gemini identifies a reusable prefix — the part of your prompt that doesn't change between requests — and stores its processed representation. Subsequent requests with the same prefix skip recomputation of those tokens.
The cacheable prefix must be the beginning of your content. You can't cache a middle section or a suffix. This constraint shapes how you structure prompts.
CACHEABLE (same across all requests):
├── System instruction
├── Reference documents
├── Domain knowledge
└── Context documents
NOT CACHEABLE (changes per request):
└── User query
Setting Up Context Caching
Context caching in Gemini works via a two-step process: first create a cached content resource, then reference it in subsequent requests.
Step 1: Create the Cached Content
# Create a cached content resource with the shared prefix
cached_content = client.cachedContents.create(
model="gemini-2.5-flash",
config={
"contents": [
{"role": "user", "parts": [{"text": system_instruction}]},
{"role": "user", "parts": [{"text": reference_documents}]},
],
"ttl": "3600s" # 1 hour
}
)
cache_name = cached_content.name # e.g., "cachedContents/xyz-123"
The TTL (time-to-live) determines how long the cache persists. Minimum is implementation-specific; typical values range from 300s (5 minutes) to 86400s (24 hours). Shorter TTLs cost less to maintain; longer TTLs amortize better for sustained workloads.
Step 2: Use the Cache in Requests
# Reference the cached prefix when generating
response = client.models.generate_content(
model="gemini-2.5-flash",
config={
"cached_content": cache_name
},
contents=[{"role": "user", "parts": [{"text": user_query}]}]
)
The cacheable prefix must be the beginning of your content. You can't cache a middle section or a suffix. This constraint shapes how you structure prompts.
Cache-Optimized Prompt Structure
# CACHEABLE PREFIX (same for all requests)
system_instruction = """
You are a legal document analyst specializing in commercial
lease agreements in California...
"""
reference_documents = """
[DOC: ca-commercial-code-2024]
... full text of relevant statutes ...
[DOC: standard-lease-template]
... full text of the standard lease form ...
[DOC: case-law-summaries]
... summaries of 50 relevant cases ...
"""
cached_prefix = system_instruction + reference_documents
# PER-REQUEST SUFFIX (changes each time)
user_query = "Review this lease for compliance issues: [LEASE TEXT]"
# Send request with cached prefix + new query
Note:
Always place your system instruction inside the cacheable prefix, even for individual requests. This ensures consistent behavior across cached and non-cached requests. If your system prompt changes between requests, you can't use caching.
Maximizing Cache Hits
1. Fixed Structure, Variable Content
The most common caching pattern: keep the structure constant, but swap reference documents.
# Cacheable: system instruction + document schema
cached_prefix = """
System instruction...
Output format specification...
Document analysis framework...
"""
# Per-request: only the document changes
documents = [load_document(doc_id) for doc_id in document_ids]
for doc in documents:
response = gemini.generate(cached_prefix + doc + analysis_query)
2. User Session Caching
For multi-turn conversations with a persistent system prompt and reference set:
# Create cache for the session
session_cache = {
"system_instruction": "...",
"project_documents": [...],
"conversation_history": [...], # Growing but still a prefix
"ttl": "3600s"
}
# Each user turn appends to conversation history
# Only the latest user message isn't cached
3. Batch Processing with Shared Context
# Cache the shared analysis framework
framework = """
Classification schema: ...
Evaluation criteria: ...
Output template: ...
"""
# Process 1000 documents, each as a separate request
for doc in large_document_set:
response = gemini.generate(framework + doc + "Classify this document.")
Cache Invalidation
Caches expire based on TTL. But you may want to invalidate earlier:
- Content updates: reference documents changed
- User session ended: free cached resources
- Cost management: cache costs exceed savings for infrequent requests
Invalidate by setting a short TTL and letting it expire, or using the API's cache management endpoints if available.
When NOT to Use Caching
Caching isn't always worth it:
| Scenario | Recommendation |
|---|---|
| One-off queries with unique context | Skip caching — no reuse value |
| Frequently changing system prompts | Can't cache — prefix always changes |
| Very short prompts (< 1K tokens) | Overhead may exceed savings |
| Highly dynamic reference docs | Cache invalidation churn erases benefit |
| Latency-sensitive real-time apps | Cache creation adds initial latency |
Cost Comparison
For a workflow processing 1,000 documents with a 50K-token shared prefix:
| Approach | Input Tokens | Cost (relative) |
|---|---|---|
| No caching | 1,000 × 50K = 50M | 100% |
| With caching | 50K (cached once) + 1,000 × 2K (variable) = 2.05M | ~4% |
This is the ideal case — your effective savings depend on prefix-to-suffix ratio and request frequency.
Note:
Caching has its own cost: you're charged for cache storage based on token count and TTL duration. For infrequent workloads (a few requests per hour with a large prefix), storage costs can exceed the savings from skipped recomputation. Calculate both sides before committing to a caching strategy.
Common Failures
| Failure | Cause | Fix |
|---|---|---|
| Low cache hit rate | Prefix changes between requests (timestamps, IDs, counters) | Strip all dynamic content from the cacheable prefix |
| TTL too short | Cache expires before you use it | Set TTL to 2x your expected batch duration |
| TTL too long | Paying for cache storage on idle data | Monitor usage; set TTL to match actual access patterns |
| Wrong prefix boundary | Trying to cache from the middle | Always place cacheable content at the very start |
Related Pages
- 1M Token Strategies — Foundation for long-context prompt structure
- Large Document Analysis — Workflows that benefit from caching
Related Articles
Extended Thinking Budget Allocation: Cost vs. Quality
Master token budget allocation for Claude's extended thinking. Understand cost tradeoffs, setting optimal budgets for different task categories, and when more thinking tokens stop adding value.
Automatic Prompt Engineering (APE)
Use LLMs to generate, score, and optimize prompts for other LLMs. APE discovered a better CoT prompt than humans did — and the same principles apply to your production prompts.
Mastering Character Portraits in Midjourney: Techniques, Styles, and Prompts
Create stunning character portraits with Midjourney using advanced prompts, lighting techniques, and artistic parameters. Explore photorealistic, artistic, fantasy, and vintage portrait styles.