The Cost Problem

Agents are expensive because they multiply LLM costs. A single-prompt chatbot costs 1 API call. An agent with a ReAct loop costs 5-15 API calls per task. Tool calls, handoffs, and error recovery add more. The cost is multiplicative — each additional step adds tokens for the growing message history.

The five levers below reduce costs at different points in the agent lifecycle. None is a silver bullet; most production agents use 2-3 in combination.

Lever 1: Model Tiering

Not every step in an agent loop needs a frontier model.

Classification ("is this about billing or technical?"): gpt-4o-mini ($0.15/1M)
Simple extraction ("extract the date from this email"):   gpt-4o-mini
Tool argument generation ("what args for search?"):       gpt-4o
Complex reasoning ("analyze these three contracts"):      claude-sonnet-4 ($3/1M)
Final synthesis ("write the executive summary"):          gpt-4o

Pattern: Route each step based on complexity.

def get_model_for_step(step_type: str, task_complexity: str):
    routing = {
        "classification":    "gpt-4o-mini",
        "extraction":        "gpt-4o-mini",
        "tool_args":         "gpt-4o",
        "reasoning":         "claude-sonnet-4" if task_complexity == "high" else "gpt-4o",
        "synthesis":         "gpt-4o",
        "safety_check":      "gpt-4o-mini",  # Safety checks are pattern matching
    }
    return routing.get(step_type, "gpt-4o")

Model Pricing Comparison (per 1M tokens)

gpt-4o-mini

Classification, extraction, routing. 10x cheaper than frontier.

Values: $0.15 input / $0.60 output

gpt-4o

General reasoning, tool use, synthesis. Good default for agent work.

Values: $2.50 input / $10 output

claude-haiku-4.5

Fast, cheap. Good for intermediate steps, safety checks.

Values: $1 input / $5 output

claude-sonnet-4.5

Complex reasoning, code generation. Strongest for agentic tasks.

Values: $3 input / $15 output

gemini-2.5-flash

Long context, multimodal. Competitive with gpt-4o-mini.

Values: $0.15 input / $0.60 output

Savings: 50-80% on per-task cost when simple steps use small models. Risk: router misclassification sends a complex task to a small model and fails.

Lever 2: Prompt Caching

Flat vector schema of prompt caching flow separating static prompts from new user input

Agent system prompts and tool definitions are identical across every turn. Caching them avoids re-processing the same tokens repeatedly.

Without caching:
  Turn 1: 5K tokens (system + tools + user message) → full price
  Turn 2: 5K tokens → full price again
  Turn 3: 5K tokens → full price again

With caching:
  Turn 1: 5K tokens → 4K cached (system+tools), 1K new → cache miss cost
  Turn 2: 5K tokens → 4K cache hit (90% discount), 1K new → ~90% cheaper
  Turn 3: 5K tokens → 4K cache hit, 1K new → ~90% cheaper

OpenAI caching: Automatic for prompts ≥1024 tokens. Cache hit = 50% discount on input tokens. Place static content first, variable content last. Extended retention (24h) on latest models. Monitor cached_tokens in usage response. No code changes needed — it's on by default.

Anthropic caching: Explicit cache_control breakpoints or automatic caching. Cache reads = 0.1x base input price (90% discount). Default 5-min TTL. 1-hour TTL at 2x write cost. Pre-warm cache with max_tokens: 0 to load system prompt before user requests arrive.

# Anthropic: explicit breakpoint on system prompt
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert code reviewer...",  # 2K tokens of instructions
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)
# Second call with same system prompt: 2K tokens at 0.1x price

# OpenAI: automatic — no code changes. Monitor cache hits.
response = client.responses.create(
    model="gpt-4o",
    input=[{"role": "system", "content": long_system_prompt},
           {"role": "user", "content": user_query}]
)
print(response.usage.input_tokens_details.cached_tokens)  # Monitor hits

Savings: 50-90% on input tokens after the first turn. Critical for agents with long system prompts and many few-shot examples. No downside other than the minimum token threshold.

Lever 3: Token Budgeting

Cap the maximum tokens per turn and per agent run. Prevents the agent from spiraling into verbose reasoning or infinite loops.

def run_agent_with_budget(agent, task, max_turns=8, max_tokens_per_turn=4096, total_budget=50000):
    total_tokens = 0

    for turn in range(max_turns):
        response = agent.run(
            task,
            max_tokens=min(max_tokens_per_turn, total_budget - total_tokens)
        )
        total_tokens += response.usage.total_tokens

        if total_tokens >= total_budget:
            return {"status": "budget_exceeded",
                    "partial_result": response.output,
                    "tokens_used": total_tokens}

        if response.done:
            return {"status": "completed",
                    "result": response.output,
                    "tokens_used": total_tokens,
                    "turns": turn + 1}

    return {"status": "max_turns_reached",
            "partial_result": response.output,
            "tokens_used": total_tokens}

Where to set limits:

Max turns: 5-10 for most agents. Increase for research agents, decrease for classification.
Max tokens per turn: Model-dependent. 4K for gpt-4o-mini, 8K for gpt-4o.
Total budget: ~50K tokens per task is generous for most agentic workflows.

Savings: 20-40% by preventing verbose spirals and limiting recovery attempts. Risk: truncating a critical completion mid-task.

Lever 4: Early Termination

Stop the agent loop when confidence is high or when additional steps show diminishing returns.

def should_terminate_early(response_history, confidence_threshold=0.9):
    """Stop if the agent is confident enough or if it's going in circles."""

    # Rule 1: Agent explicitly signals completion
    if "FINAL_ANSWER" in response_history[-1] or "task complete" in response_history[-1].lower():
        return True, "agent_signaled_completion"

    # Rule 2: Last 3 steps produced no new information
    if len(response_history) >= 3:
        last_three = response_history[-3:]
        if all(len(r) < 200 for r in last_three):
            return True, "diminishing_returns"

    # Rule 3: Agent is repeating itself (circular loop)
    if len(response_history) >= 2:
        if response_history[-1][:200] == response_history[-2][:200]:
            return True, "repeating_output"

    # Rule 4: LLM estimates high confidence (expensive check)
    if len(response_history) > 2:
        score = estimate_confidence(response_history)
        if score > confidence_threshold:
            return True, f"high_confidence_{score:.2f}"

    return False, None

Savings: 15-30% on per-task cost. Risk: terminating early on a task that needed 1-2 more steps to complete correctly. Best applied when false negatives are cheaper than unnecessary steps.

Lever 5: Compaction / Summarization

When conversation history exceeds the context window, summarize older messages instead of truncating them.

Without compaction:
  Turn 50: Context = 45K tokens (full history)
  Turn 100: Context window exceeded → oldest messages dropped

With compaction:
  Turn 50: Context = 15K (last 10 turns + summary of earlier 40)
  Turn 100: Context = 15K (last 10 turns + updated summary)

def compact_conversation(messages, keep_last=10):
    """Summarize older messages, keep recent ones in full."""
    if len(messages) <= keep_last:
        return messages

    to_summarize = messages[:-keep_last]
    recent = messages[-keep_last:]

    summary_prompt = f"Summarize this conversation in 3 bullet points:\n\n{to_summarize}"
    summary = llm.generate(summary_prompt, max_tokens=200)

    return [{"role": "system", "content": f"Earlier conversation summary: {summary}"}] + recent

Built-in options: OpenAI and Anthropic both offer automatic compaction. Letta has memory blocks that the agent updates itself. The tradeoff is information loss vs cost savings.

Savings: 30-60% on context tokens for long-running agents. Risk: summarizing away the one detail the next question needs.

The Full Stack: Combining Levers

Cyberpunk style optimized agent loop dashboard tracking turns and budgets

A production agent might use all five:

class OptimizedAgent:
    def __init__(self):
        self.router = ModelRouter()            # Lever 1: model tiering
        self.cache = PromptCache()             # Lever 2: prompt caching
        self.budget = TokenBudget(50000)       # Lever 3: token budgeting
        self.terminator = EarlyTerminator()    # Lever 4: early termination
        self.compactor = ContextCompactor()    # Lever 5: compaction

    def run(self, task):
        while not self.is_done():
            model = self.router.get_model(self.current_step_type())
            response = self.call_llm(model, self.build_prompt())
            self.budget.spend(response.usage.total_tokens)

            if self.terminator.should_stop(self.history):
                break
            if self.budget.exhausted():
                return self.partial_result()
            if self.context_too_long():
                self.compactor.compress()

        return self.final_result()

Tracking Costs

You can't optimize what you don't measure. Track these per task:

def track_agent_metrics(run):
    return {
        "task_id": run.task_id,
        "success": run.success,
        "turns": len(run.steps),
        "input_tokens": sum(s.usage.input_tokens for s in run.steps),
        "cached_tokens": sum(s.usage.cached_tokens for s in run.steps),
        "output_tokens": sum(s.usage.output_tokens for s in run.steps),
        "total_cost": sum(s.usage.cost for s in run.steps),
        "model_breakdown": Counter(s.model for s in run.steps),
        "tool_calls": sum(len(s.tool_calls) for s in run.steps),
        "premature_termination": run.terminated_early,
        "compaction_events": run.compaction_count
    }

Cost Reduction Summary

Model Tiering

Route simple steps to cheap models. Classification, extraction, safety checks don't need frontier models.

Values: 50-80% reduction

Prompt Caching

Cache static prefixes (system prompt, tools). Free on OpenAI, explicit on Anthropic. Largest single savings lever.

Values: 50-90% on input tokens

Token Budgeting

Cap turns and tokens. Prevents infinite loops and verbose spirals. Set per-agent limits based on task complexity.

Values: 20-40% reduction

Early Termination

Stop when done, not when max turns hit. Confidence checks and diminishing-return detection.

Values: 15-30% reduction

Compaction

Summarize old context. Essential for long-running agents. Information loss is the main risk.

Values: 30-60% on context tokens

Note:

Don't optimize prematurely. An agent that costs $0.10 per task and runs 100 tasks a day costs $10/day. An agent that costs $5 per task and runs 100 tasks a day costs $500/day. Optimize the second agent. The first one is fine. Measure first, optimize second.

Key Takeaway

Prompt caching is the single highest-leverage cost optimization for agents — it's free (or nearly free) to implement and saves 50-90% on input tokens after the first turn. Implement it first. Model tiering is second: your agent's classification step doesn't need a frontier model. Token budgeting and early termination prevent runaway costs. Compaction keeps long-running agents sustainable. Track cost per task, not just total spend — it tells you whether optimizations are actually working.

OpenAI Agents SDK: Architecture Deep-Dive and Framework Comparison

Detailed technical analysis of OpenAI's new Agents SDK — architecture, tool-use patterns, multi-agent orchestration, guardrails, tracing, and how it compares to LangGraph, AutoGen, and CrewAI across dimensions that matter for production deployments.

File-Based ETL Agent Blueprint

AI agent that builds ETL pipelines from CSV and JSON files: infers schemas, validates data, transforms and cleans, validates output, and generates documentation. Self-contained, no external DB needed.

Sandboxed Code Execution for AI Agents with MicroPython + WASM

Step-by-step tutorial on building a safe code-execution tool for AI agents using MicroPython compiled to WebAssembly. Covers installation, one-shot and persistent sessions, resource limits, host functions, and integration into agent tool loops — with working code you can copy and run.

Agent Cost Optimization