The Cost Landscape

API costs vary dramatically by provider, model, and usage pattern. Understanding these differences is the first step to optimization.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Context Window
GPT-4o	$2.50	$10.00	128K
GPT-4o-mini	$0.15	$0.60	128K
Claude 3.5 Sonnet	$3.00	$15.00	200K
Claude Haiku	$0.25	$1.25	200K
Gemini 1.5 Pro	$1.25	$5.00	2M
Gemini 1.5 Flash	$0.075	$0.30	1M

Model Routing: Use the Right Model

Not every task needs GPT-4o or Claude Sonnet. Route simple tasks to smaller models.

def route_task(query, complexity):
    """Route to appropriate model based on task complexity."""
    if complexity == "trivial":
        return "gpt-4o-mini"  # $0.15/$0.60 per 1M

    if complexity == "classification":
        return "claude-haiku"  # $0.25/$1.25 per 1M

    if complexity == "summarization":
        return "gemini-1.5-flash"  # $0.075/$0.30 per 1M

    if complexity == "reasoning":
        return "claude-3.5-sonnet"  # $3.00/$15.00 per 1M

    if complexity == "creative":
        return "gpt-4o"  # $2.50/$10.00 per 1M

    # Default: use small model to classify, then route
    classifier = classify_with("gpt-4o-mini", query)
    return route_task(query, classifier["complexity"])

Two-tier routing pattern:

Step 1: Small model classifies the task → "this is a simple classification"
Step 2: Router dispatches to appropriate model based on classification
Step 3: Large model only handles complex cases

Cost savings: 60-80% vs sending everything to GPT-4o.

Prompt Caching

All three major providers support caching. Put static content first (system prompt, few-shot examples) and dynamic content last (user query).

Provider	Mechanism	Discount	Cache Lifetime
Anthropic	Prompt Caching API	90% on cache hits	5 min idle, up to 1hr
Google	Context Caching	75% on cache hits	Configurable TTL
OpenAI	Automatic	No discount (just latency)	Automatic

Cache-optimized prompt structure:

[Cache breakpoint]
System prompt (cached — never changes)
Few-shot examples (cached — static)
Tool definitions (cached — static)
[End cached prefix]

User query (dynamic — varies per request)

# Anthropic: prompt caching with cache_control
import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant...",
            "cache_control": {"type": "ephemeral"}  # Cache me
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)
# First call: full price. Subsequent calls with same system prompt: 90% off.

Batch Processing

OpenAI offers a batch API with 50% discount and 24-hour turnaround. Perfect for offline evaluation, data processing, and non-realtime tasks.

import openai
import json

# Prepare batch file
batch_requests = []
for i, query in enumerate(queries):
    batch_requests.append({
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": query}]
        }
    })

# Upload batch file
with open("batch_input.jsonl", "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")

batch_file = openai.files.create(
    file=open("batch_input.jsonl", "rb"),
    purpose="batch"
)

# Submit batch job (50% discount, 24hr turnaround)
batch = openai.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# Check status and retrieve results
print(f"Batch {batch.id}: {openai.batches.retrieve(batch.id).status}")

Output Token Optimization

Every output token costs 3-4x more than input tokens.

Strategy	Typical Savings	Impact on Quality
Set `max_tokens`	Prevents runaway outputs	None if set appropriately
Ask for concise responses	30-50% fewer tokens	Minimal for factual tasks
Use `stop` sequences	Stops at exact boundary	None
Request structured output	Reduces verbosity	Improves consistency
Streaming	No cost savings (same tokens)	Improves perceived latency

# Before: verbose
"Please write a detailed explanation of..."

# After: constrained
"Explain in 2-3 sentences:"

Cost Tracking

Log every API call to detect cost anomalies.

import time
from dataclasses import dataclass

@dataclass
class Usage:
    model: str
    input_tokens: int
    output_tokens: int
    cost: float
    timestamp: float

class CostTracker:
    def __init__(self):
        self.records = []

    def track(self, model, response):
        usage = Usage(
            model=model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cost=self.calculate_cost(model, response.usage),
            timestamp=time.time()
        )
        self.records.append(usage)

        # Alert on spike
        if len(self.records) > 10:
            recent_avg = sum(r.cost for r in self.records[-10:]) / 10
            if usage.cost > recent_avg * 3:
                print(f"Cost spike: ${usage.cost:.4f} vs avg ${recent_avg:.4f}")

    def calculate_cost(self, model, usage):
        rates = {
            "gpt-4o": (2.50, 10.00),
            "gpt-4o-mini": (0.15, 0.60),
            "claude-3.5-sonnet": (3.00, 15.00),
            "claude-haiku": (0.25, 1.25),
        }
        input_rate, output_rate = rates.get(model, (0, 0))
        input_cost = (usage.input_tokens / 1_000_000) * input_rate
        output_cost = (usage.output_tokens / 1_000_000) * output_rate
        return input_cost + output_cost

Real-World Savings

A typical chatbot handling 1,000 queries/day, average 500 input + 100 output tokens:

Strategy	Daily Cost (GPT-4o)	Daily Cost (Optimized)	Savings
All queries to GPT-4o	$2.25	—	—
Route 70% to GPT-4o-mini	—	$0.95	58%
+ Prompt caching	—	$0.50	78%
+ Constrained output tokens	—	$0.38	83%
+ Batch processing for eval	—	$0.25	89%

Each layer of optimization compounds. Start with model routing (biggest win), add caching, then constrain outputs.

API Cost Optimization: Cut LLM Expenses by 80%

The Cost Landscape

Model Routing: Use the Right Model

Prompt Caching

Batch Processing

Output Token Optimization

Cost Tracking

Real-World Savings

Related Articles

Cultural Etiquette Tips for Professional Communication

Architectural Minimalism SREF Codes

Midjourney Weapon Design: Master Prompts for Swords, Guns & More

On this page