Agent Cost Optimization

Five levers for reducing agent API costs — model tiering, prompt caching, token budgeting, early termination, and compaction. Production patterns with real savings estimates.

June 12, 2026
cost-optimizationtoken-budgetingprompt-cachingmodel-tieringagent-costscompaction

The Cost Problem

Agents are expensive because they multiply LLM costs. A single-prompt chatbot costs 1 API call. An agent with a ReAct loop costs 5-15 API calls per task. Tool calls, handoffs, and error recovery add more. The cost is multiplicative — each additional step adds tokens for the growing message history.

The five levers below reduce costs at different points in the agent lifecycle. None is a silver bullet; most production agents use 2-3 in combination.

Lever 1: Model Tiering

Not every step in an agent loop needs a frontier model.

Classification ("is this about billing or technical?"): gpt-4o-mini ($0.15/1M)
Simple extraction ("extract the date from this email"):   gpt-4o-mini
Tool argument generation ("what args for search?"):       gpt-4o
Complex reasoning ("analyze these three contracts"):      claude-sonnet-4 ($3/1M)
Final synthesis ("write the executive summary"):          gpt-4o

Pattern: Route each step based on complexity.

def get_model_for_step(step_type: str, task_complexity: str):
    routing = {
        "classification":    "gpt-4o-mini",
        "extraction":        "gpt-4o-mini",
        "tool_args":         "gpt-4o",
        "reasoning":         "claude-sonnet-4" if task_complexity == "high" else "gpt-4o",
        "synthesis":         "gpt-4o",
        "safety_check":      "gpt-4o-mini",  # Safety checks are pattern matching
    }
    return routing.get(step_type, "gpt-4o")

Model Pricing Comparison (per 1M tokens)

gpt-4o-mini
Classification, extraction, routing. 10x cheaper than frontier.

Values: $0.15 input / $0.60 output

gpt-4o
General reasoning, tool use, synthesis. Good default for agent work.

Values: $2.50 input / $10 output

claude-haiku-4.5
Fast, cheap. Good for intermediate steps, safety checks.

Values: $1 input / $5 output

claude-sonnet-4.5
Complex reasoning, code generation. Strongest for agentic tasks.

Values: $3 input / $15 output

gemini-2.5-flash
Long context, multimodal. Competitive with gpt-4o-mini.

Values: $0.15 input / $0.60 output

Savings: 50-80% on per-task cost when simple steps use small models. Risk: router misclassification sends a complex task to a small model and fails.

Lever 2: Prompt Caching

Agent system prompts and tool definitions are identical across every turn. Caching them avoids re-processing the same tokens repeatedly.

Without caching:
  Turn 1: 5K tokens (system + tools + user message) → full price
  Turn 2: 5K tokens → full price again
  Turn 3: 5K tokens → full price again

With caching:
  Turn 1: 5K tokens → 4K cached (system+tools), 1K new → cache miss cost
  Turn 2: 5K tokens → 4K cache hit (90% discount), 1K new → ~90% cheaper
  Turn 3: 5K tokens → 4K cache hit, 1K new → ~90% cheaper

OpenAI caching: Automatic for prompts ≥1024 tokens. Cache hit = 50% discount on input tokens. Place static content first, variable content last. Extended retention (24h) on latest models. Monitor cached_tokens in usage response. No code changes needed — it's on by default.

Anthropic caching: Explicit cache_control breakpoints or automatic caching. Cache reads = 0.1x base input price (90% discount). Default 5-min TTL. 1-hour TTL at 2x write cost. Pre-warm cache with max_tokens: 0 to load system prompt before user requests arrive.

# Anthropic: explicit breakpoint on system prompt
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an expert code reviewer...",  # 2K tokens of instructions
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[{"role": "user", "content": user_query}]
)
# Second call with same system prompt: 2K tokens at 0.1x price
# OpenAI: automatic — no code changes. Monitor cache hits.
response = client.responses.create(
    model="gpt-4o",
    input=[{"role": "system", "content": long_system_prompt},
           {"role": "user", "content": user_query}]
)
print(response.usage.input_tokens_details.cached_tokens)  # Monitor hits

Savings: 50-90% on input tokens after the first turn. Critical for agents with long system prompts and many few-shot examples. No downside other than the minimum token threshold.

Lever 3: Token Budgeting

Cap the maximum tokens per turn and per agent run. Prevents the agent from spiraling into verbose reasoning or infinite loops.

def run_agent_with_budget(agent, task, max_turns=8, max_tokens_per_turn=4096, total_budget=50000):
    total_tokens = 0

    for turn in range(max_turns):
        response = agent.run(
            task,
            max_tokens=min(max_tokens_per_turn, total_budget - total_tokens)
        )
        total_tokens += response.usage.total_tokens

        if total_tokens >= total_budget:
            return {"status": "budget_exceeded",
                    "partial_result": response.output,
                    "tokens_used": total_tokens}

        if response.done:
            return {"status": "completed",
                    "result": response.output,
                    "tokens_used": total_tokens,
                    "turns": turn + 1}

    return {"status": "max_turns_reached",
            "partial_result": response.output,
            "tokens_used": total_tokens}

Where to set limits:

  • Max turns: 5-10 for most agents. Increase for research agents, decrease for classification.
  • Max tokens per turn: Model-dependent. 4K for gpt-4o-mini, 8K for gpt-4o.
  • Total budget: ~50K tokens per task is generous for most agentic workflows.

Savings: 20-40% by preventing verbose spirals and limiting recovery attempts. Risk: truncating a critical completion mid-task.

Lever 4: Early Termination

Stop the agent loop when confidence is high or when additional steps show diminishing returns.

def should_terminate_early(response_history, confidence_threshold=0.9):
    """Stop if the agent is confident enough or if it's going in circles."""

    # Rule 1: Agent explicitly signals completion
    if "FINAL_ANSWER" in response_history[-1] or "task complete" in response_history[-1].lower():
        return True, "agent_signaled_completion"

    # Rule 2: Last 3 steps produced no new information
    if len(response_history) >= 3:
        last_three = response_history[-3:]
        if all(len(r) < 200 for r in last_three):
            return True, "diminishing_returns"

    # Rule 3: Agent is repeating itself (circular loop)
    if len(response_history) >= 2:
        if response_history[-1][:200] == response_history[-2][:200]:
            return True, "repeating_output"

    # Rule 4: LLM estimates high confidence (expensive check)
    if len(response_history) > 2:
        score = estimate_confidence(response_history)
        if score > confidence_threshold:
            return True, f"high_confidence_{score:.2f}"

    return False, None

Savings: 15-30% on per-task cost. Risk: terminating early on a task that needed 1-2 more steps to complete correctly. Best applied when false negatives are cheaper than unnecessary steps.

Lever 5: Compaction / Summarization

When conversation history exceeds the context window, summarize older messages instead of truncating them.

Without compaction:
  Turn 50: Context = 45K tokens (full history)
  Turn 100: Context window exceeded → oldest messages dropped

With compaction:
  Turn 50: Context = 15K (last 10 turns + summary of earlier 40)
  Turn 100: Context = 15K (last 10 turns + updated summary)
def compact_conversation(messages, keep_last=10):
    """Summarize older messages, keep recent ones in full."""
    if len(messages) <= keep_last:
        return messages

    to_summarize = messages[:-keep_last]
    recent = messages[-keep_last:]

    summary_prompt = f"Summarize this conversation in 3 bullet points:\n\n{to_summarize}"
    summary = llm.generate(summary_prompt, max_tokens=200)

    return [{"role": "system", "content": f"Earlier conversation summary: {summary}"}] + recent

Built-in options: OpenAI and Anthropic both offer automatic compaction. Letta has memory blocks that the agent updates itself. The tradeoff is information loss vs cost savings.

Savings: 30-60% on context tokens for long-running agents. Risk: summarizing away the one detail the next question needs.

The Full Stack: Combining Levers

A production agent might use all five:

class OptimizedAgent:
    def __init__(self):
        self.router = ModelRouter()            # Lever 1: model tiering
        self.cache = PromptCache()             # Lever 2: prompt caching
        self.budget = TokenBudget(50000)       # Lever 3: token budgeting
        self.terminator = EarlyTerminator()    # Lever 4: early termination
        self.compactor = ContextCompactor()    # Lever 5: compaction

    def run(self, task):
        while not self.is_done():
            model = self.router.get_model(self.current_step_type())
            response = self.call_llm(model, self.build_prompt())
            self.budget.spend(response.usage.total_tokens)

            if self.terminator.should_stop(self.history):
                break
            if self.budget.exhausted():
                return self.partial_result()
            if self.context_too_long():
                self.compactor.compress()

        return self.final_result()

Tracking Costs

You can't optimize what you don't measure. Track these per task:

def track_agent_metrics(run):
    return {
        "task_id": run.task_id,
        "success": run.success,
        "turns": len(run.steps),
        "input_tokens": sum(s.usage.input_tokens for s in run.steps),
        "cached_tokens": sum(s.usage.cached_tokens for s in run.steps),
        "output_tokens": sum(s.usage.output_tokens for s in run.steps),
        "total_cost": sum(s.usage.cost for s in run.steps),
        "model_breakdown": Counter(s.model for s in run.steps),
        "tool_calls": sum(len(s.tool_calls) for s in run.steps),
        "premature_termination": run.terminated_early,
        "compaction_events": run.compaction_count
    }

Cost Reduction Summary

Model Tiering
Route simple steps to cheap models. Classification, extraction, safety checks don't need frontier models.

Values: 50-80% reduction

Prompt Caching
Cache static prefixes (system prompt, tools). Free on OpenAI, explicit on Anthropic. Largest single savings lever.

Values: 50-90% on input tokens

Token Budgeting
Cap turns and tokens. Prevents infinite loops and verbose spirals. Set per-agent limits based on task complexity.

Values: 20-40% reduction

Early Termination
Stop when done, not when max turns hit. Confidence checks and diminishing-return detection.

Values: 15-30% reduction

Compaction
Summarize old context. Essential for long-running agents. Information loss is the main risk.

Values: 30-60% on context tokens

Note:

Don't optimize prematurely. An agent that costs $0.10 per task and runs 100 tasks a day costs $10/day. An agent that costs $5 per task and runs 100 tasks a day costs $500/day. Optimize the second agent. The first one is fine. Measure first, optimize second.

Key Takeaway

Prompt caching is the single highest-leverage cost optimization for agents — it's free (or nearly free) to implement and saves 50-90% on input tokens after the first turn. Implement it first. Model tiering is second: your agent's classification step doesn't need a frontier model. Token budgeting and early termination prevent runaway costs. Compaction keeps long-running agents sustainable. Track cost per task, not just total spend — it tells you whether optimizations are actually working.