API Cost Optimization: Cut LLM Expenses by 80%
Token counting, model routing, batch processing, and caching across OpenAI, Anthropic, and Google. Practical strategies to optimize API costs without sacrificing output quality.
The Cost Landscape
API costs vary dramatically by provider, model, and usage pattern. Understanding these differences is the first step to optimization.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Context Window |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | 128K |
| GPT-4o-mini | $0.15 | $0.60 | 128K |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 200K |
| Claude Haiku | $0.25 | $1.25 | 200K |
| Gemini 1.5 Pro | $1.25 | $5.00 | 2M |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1M |
Model Routing: Use the Right Model
Not every task needs GPT-4o or Claude Sonnet. Route simple tasks to smaller models.
def route_task(query, complexity):
"""Route to appropriate model based on task complexity."""
if complexity == "trivial":
return "gpt-4o-mini" # $0.15/$0.60 per 1M
if complexity == "classification":
return "claude-haiku" # $0.25/$1.25 per 1M
if complexity == "summarization":
return "gemini-1.5-flash" # $0.075/$0.30 per 1M
if complexity == "reasoning":
return "claude-3.5-sonnet" # $3.00/$15.00 per 1M
if complexity == "creative":
return "gpt-4o" # $2.50/$10.00 per 1M
# Default: use small model to classify, then route
classifier = classify_with("gpt-4o-mini", query)
return route_task(query, classifier["complexity"])
Two-tier routing pattern:
Step 1: Small model classifies the task → "this is a simple classification"
Step 2: Router dispatches to appropriate model based on classification
Step 3: Large model only handles complex cases
Cost savings: 60-80% vs sending everything to GPT-4o.
Prompt Caching
All three major providers support caching. Put static content first (system prompt, few-shot examples) and dynamic content last (user query).
| Provider | Mechanism | Discount | Cache Lifetime |
|---|---|---|---|
| Anthropic | Prompt Caching API | 90% on cache hits | 5 min idle, up to 1hr |
| Context Caching | 75% on cache hits | Configurable TTL | |
| OpenAI | Automatic | No discount (just latency) | Automatic |
Cache-optimized prompt structure:
[Cache breakpoint]
System prompt (cached — never changes)
Few-shot examples (cached — static)
Tool definitions (cached — static)
[End cached prefix]
User query (dynamic — varies per request)
# Anthropic: prompt caching with cache_control
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
system=[
{
"type": "text",
"text": "You are a helpful assistant...",
"cache_control": {"type": "ephemeral"} # Cache me
}
],
messages=[{"role": "user", "content": user_query}]
)
# First call: full price. Subsequent calls with same system prompt: 90% off.
Batch Processing
OpenAI offers a batch API with 50% discount and 24-hour turnaround. Perfect for offline evaluation, data processing, and non-realtime tasks.
import openai
import json
# Prepare batch file
batch_requests = []
for i, query in enumerate(queries):
batch_requests.append({
"custom_id": f"request-{i}",
"method": "POST",
"url": "/v1/chat/completions",
"body": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": query}]
}
})
# Upload batch file
with open("batch_input.jsonl", "w") as f:
for req in batch_requests:
f.write(json.dumps(req) + "\n")
batch_file = openai.files.create(
file=open("batch_input.jsonl", "rb"),
purpose="batch"
)
# Submit batch job (50% discount, 24hr turnaround)
batch = openai.batches.create(
input_file_id=batch_file.id,
endpoint="/v1/chat/completions",
completion_window="24h"
)
# Check status and retrieve results
print(f"Batch {batch.id}: {openai.batches.retrieve(batch.id).status}")
Output Token Optimization
Every output token costs 3-4x more than input tokens.
| Strategy | Typical Savings | Impact on Quality |
|---|---|---|
Set max_tokens | Prevents runaway outputs | None if set appropriately |
| Ask for concise responses | 30-50% fewer tokens | Minimal for factual tasks |
Use stop sequences | Stops at exact boundary | None |
| Request structured output | Reduces verbosity | Improves consistency |
| Streaming | No cost savings (same tokens) | Improves perceived latency |
# Before: verbose
"Please write a detailed explanation of..."
# After: constrained
"Explain in 2-3 sentences:"
Cost Tracking
Log every API call to detect cost anomalies.
import time
from dataclasses import dataclass
@dataclass
class Usage:
model: str
input_tokens: int
output_tokens: int
cost: float
timestamp: float
class CostTracker:
def __init__(self):
self.records = []
def track(self, model, response):
usage = Usage(
model=model,
input_tokens=response.usage.input_tokens,
output_tokens=response.usage.output_tokens,
cost=self.calculate_cost(model, response.usage),
timestamp=time.time()
)
self.records.append(usage)
# Alert on spike
if len(self.records) > 10:
recent_avg = sum(r.cost for r in self.records[-10:]) / 10
if usage.cost > recent_avg * 3:
print(f"Cost spike: ${usage.cost:.4f} vs avg ${recent_avg:.4f}")
def calculate_cost(self, model, usage):
rates = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-3.5-sonnet": (3.00, 15.00),
"claude-haiku": (0.25, 1.25),
}
input_rate, output_rate = rates.get(model, (0, 0))
input_cost = (usage.input_tokens / 1_000_000) * input_rate
output_cost = (usage.output_tokens / 1_000_000) * output_rate
return input_cost + output_cost
Real-World Savings
A typical chatbot handling 1,000 queries/day, average 500 input + 100 output tokens:
| Strategy | Daily Cost (GPT-4o) | Daily Cost (Optimized) | Savings |
|---|---|---|---|
| All queries to GPT-4o | $2.25 | — | — |
| Route 70% to GPT-4o-mini | — | $0.95 | 58% |
| + Prompt caching | — | $0.50 | 78% |
| + Constrained output tokens | — | $0.38 | 83% |
| + Batch processing for eval | — | $0.25 | 89% |
Each layer of optimization compounds. Start with model routing (biggest win), add caching, then constrain outputs.
Related Articles
Midjourney Horror & Thriller SREF Codes: Cinematic Guide
Discover Midjourney SREF codes for creating tense horror and thriller cinematics. Generate unsettling atmospheres with dramatic lighting, eerie shadows, and suspenseful visual styles.
Interior Design Prompts: Room Visualization
Visualize room layouts, furniture styles, and decor themes with Nano Banana. Create photorealistic interior renders.
Accessories & Details Prompts: Generate Fashion Accents
Generate stunning jewelry, bags, shoes, and fashion details with Nano Banana. Master the art of accessory design and visualization.