API Cost Optimization: Cut LLM Expenses by 80%

Token counting, model routing, batch processing, and caching across OpenAI, Anthropic, and Google. Practical strategies to optimize API costs without sacrificing output quality.

June 10, 2026
api-costtoken-optimizationcachingbatch-processingprompt-engineering

The Cost Landscape

API costs vary dramatically by provider, model, and usage pattern. Understanding these differences is the first step to optimization.

ModelInput (per 1M tokens)Output (per 1M tokens)Context Window
GPT-4o$2.50$10.00128K
GPT-4o-mini$0.15$0.60128K
Claude 3.5 Sonnet$3.00$15.00200K
Claude Haiku$0.25$1.25200K
Gemini 1.5 Pro$1.25$5.002M
Gemini 1.5 Flash$0.075$0.301M

Model Routing: Use the Right Model

Not every task needs GPT-4o or Claude Sonnet. Route simple tasks to smaller models.

def route_task(query, complexity):
    """Route to appropriate model based on task complexity."""
    if complexity == "trivial":
        return "gpt-4o-mini"  # $0.15/$0.60 per 1M

    if complexity == "classification":
        return "claude-haiku"  # $0.25/$1.25 per 1M

    if complexity == "summarization":
        return "gemini-1.5-flash"  # $0.075/$0.30 per 1M

    if complexity == "reasoning":
        return "claude-3.5-sonnet"  # $3.00/$15.00 per 1M

    if complexity == "creative":
        return "gpt-4o"  # $2.50/$10.00 per 1M

    # Default: use small model to classify, then route
    classifier = classify_with("gpt-4o-mini", query)
    return route_task(query, classifier["complexity"])

Two-tier routing pattern:

Step 1: Small model classifies the task → "this is a simple classification"
Step 2: Router dispatches to appropriate model based on classification
Step 3: Large model only handles complex cases

Cost savings: 60-80% vs sending everything to GPT-4o.

Prompt Caching

All three major providers support caching. Put static content first (system prompt, few-shot examples) and dynamic content last (user query).

ProviderMechanismDiscountCache Lifetime
AnthropicPrompt Caching API90% on cache hits5 min idle, up to 1hr
GoogleContext Caching75% on cache hitsConfigurable TTL
OpenAIAutomaticNo discount (just latency)Automatic

Cache-optimized prompt structure:

[Cache breakpoint]
System prompt (cached — never changes)
Few-shot examples (cached — static)
Tool definitions (cached — static)
[End cached prefix]

User query (dynamic — varies per request)
# Anthropic: prompt caching with cache_control
import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant...",
            "cache_control": {"type": "ephemeral"}  # Cache me
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)
# First call: full price. Subsequent calls with same system prompt: 90% off.

Batch Processing

OpenAI offers a batch API with 50% discount and 24-hour turnaround. Perfect for offline evaluation, data processing, and non-realtime tasks.

import openai
import json

# Prepare batch file
batch_requests = []
for i, query in enumerate(queries):
    batch_requests.append({
        "custom_id": f"request-{i}",
        "method": "POST",
        "url": "/v1/chat/completions",
        "body": {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": query}]
        }
    })

# Upload batch file
with open("batch_input.jsonl", "w") as f:
    for req in batch_requests:
        f.write(json.dumps(req) + "\n")

batch_file = openai.files.create(
    file=open("batch_input.jsonl", "rb"),
    purpose="batch"
)

# Submit batch job (50% discount, 24hr turnaround)
batch = openai.batches.create(
    input_file_id=batch_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h"
)

# Check status and retrieve results
print(f"Batch {batch.id}: {openai.batches.retrieve(batch.id).status}")

Output Token Optimization

Every output token costs 3-4x more than input tokens.

StrategyTypical SavingsImpact on Quality
Set max_tokensPrevents runaway outputsNone if set appropriately
Ask for concise responses30-50% fewer tokensMinimal for factual tasks
Use stop sequencesStops at exact boundaryNone
Request structured outputReduces verbosityImproves consistency
StreamingNo cost savings (same tokens)Improves perceived latency
# Before: verbose
"Please write a detailed explanation of..."

# After: constrained
"Explain in 2-3 sentences:"

Cost Tracking

Log every API call to detect cost anomalies.

import time
from dataclasses import dataclass

@dataclass
class Usage:
    model: str
    input_tokens: int
    output_tokens: int
    cost: float
    timestamp: float

class CostTracker:
    def __init__(self):
        self.records = []

    def track(self, model, response):
        usage = Usage(
            model=model,
            input_tokens=response.usage.input_tokens,
            output_tokens=response.usage.output_tokens,
            cost=self.calculate_cost(model, response.usage),
            timestamp=time.time()
        )
        self.records.append(usage)

        # Alert on spike
        if len(self.records) > 10:
            recent_avg = sum(r.cost for r in self.records[-10:]) / 10
            if usage.cost > recent_avg * 3:
                print(f"Cost spike: ${usage.cost:.4f} vs avg ${recent_avg:.4f}")

    def calculate_cost(self, model, usage):
        rates = {
            "gpt-4o": (2.50, 10.00),
            "gpt-4o-mini": (0.15, 0.60),
            "claude-3.5-sonnet": (3.00, 15.00),
            "claude-haiku": (0.25, 1.25),
        }
        input_rate, output_rate = rates.get(model, (0, 0))
        input_cost = (usage.input_tokens / 1_000_000) * input_rate
        output_cost = (usage.output_tokens / 1_000_000) * output_rate
        return input_cost + output_cost

Real-World Savings

A typical chatbot handling 1,000 queries/day, average 500 input + 100 output tokens:

StrategyDaily Cost (GPT-4o)Daily Cost (Optimized)Savings
All queries to GPT-4o$2.25
Route 70% to GPT-4o-mini$0.9558%
+ Prompt caching$0.5078%
+ Constrained output tokens$0.3883%
+ Batch processing for eval$0.2589%

Each layer of optimization compounds. Start with model routing (biggest win), add caching, then constrain outputs.