Agent Cost Optimization
Five levers for reducing agent API costs — model tiering, prompt caching, token budgeting, early termination, and compaction. Production patterns with real savings estimates.
The Cost Problem
Agents are expensive because they multiply LLM costs. A single-prompt chatbot costs 1 API call. An agent with a ReAct loop costs 5-15 API calls per task. Tool calls, handoffs, and error recovery add more. The cost is multiplicative — each additional step adds tokens for the growing message history.
The five levers below reduce costs at different points in the agent lifecycle. None is a silver bullet; most production agents use 2-3 in combination.
Lever 1: Model Tiering
Not every step in an agent loop needs a frontier model.
Classification ("is this about billing or technical?"): gpt-4o-mini ($0.15/1M)
Simple extraction ("extract the date from this email"): gpt-4o-mini
Tool argument generation ("what args for search?"): gpt-4o
Complex reasoning ("analyze these three contracts"): claude-sonnet-4 ($3/1M)
Final synthesis ("write the executive summary"): gpt-4o
Pattern: Route each step based on complexity.
def get_model_for_step(step_type: str, task_complexity: str):
routing = {
"classification": "gpt-4o-mini",
"extraction": "gpt-4o-mini",
"tool_args": "gpt-4o",
"reasoning": "claude-sonnet-4" if task_complexity == "high" else "gpt-4o",
"synthesis": "gpt-4o",
"safety_check": "gpt-4o-mini", # Safety checks are pattern matching
}
return routing.get(step_type, "gpt-4o")
Model Pricing Comparison (per 1M tokens)
Values: $0.15 input / $0.60 output
Values: $2.50 input / $10 output
Values: $1 input / $5 output
Values: $3 input / $15 output
Values: $0.15 input / $0.60 output
Savings: 50-80% on per-task cost when simple steps use small models. Risk: router misclassification sends a complex task to a small model and fails.
Lever 2: Prompt Caching
Agent system prompts and tool definitions are identical across every turn. Caching them avoids re-processing the same tokens repeatedly.
Without caching:
Turn 1: 5K tokens (system + tools + user message) → full price
Turn 2: 5K tokens → full price again
Turn 3: 5K tokens → full price again
With caching:
Turn 1: 5K tokens → 4K cached (system+tools), 1K new → cache miss cost
Turn 2: 5K tokens → 4K cache hit (90% discount), 1K new → ~90% cheaper
Turn 3: 5K tokens → 4K cache hit, 1K new → ~90% cheaper
OpenAI caching: Automatic for prompts ≥1024 tokens. Cache hit = 50% discount on input tokens. Place static content first, variable content last. Extended retention (24h) on latest models. Monitor cached_tokens in usage response. No code changes needed — it's on by default.
Anthropic caching: Explicit cache_control breakpoints or automatic caching. Cache reads = 0.1x base input price (90% discount). Default 5-min TTL. 1-hour TTL at 2x write cost. Pre-warm cache with max_tokens: 0 to load system prompt before user requests arrive.
# Anthropic: explicit breakpoint on system prompt
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are an expert code reviewer...", # 2K tokens of instructions
"cache_control": {"type": "ephemeral"}
}
],
messages=[{"role": "user", "content": user_query}]
)
# Second call with same system prompt: 2K tokens at 0.1x price
# OpenAI: automatic — no code changes. Monitor cache hits.
response = client.responses.create(
model="gpt-4o",
input=[{"role": "system", "content": long_system_prompt},
{"role": "user", "content": user_query}]
)
print(response.usage.input_tokens_details.cached_tokens) # Monitor hits
Savings: 50-90% on input tokens after the first turn. Critical for agents with long system prompts and many few-shot examples. No downside other than the minimum token threshold.
Lever 3: Token Budgeting
Cap the maximum tokens per turn and per agent run. Prevents the agent from spiraling into verbose reasoning or infinite loops.
def run_agent_with_budget(agent, task, max_turns=8, max_tokens_per_turn=4096, total_budget=50000):
total_tokens = 0
for turn in range(max_turns):
response = agent.run(
task,
max_tokens=min(max_tokens_per_turn, total_budget - total_tokens)
)
total_tokens += response.usage.total_tokens
if total_tokens >= total_budget:
return {"status": "budget_exceeded",
"partial_result": response.output,
"tokens_used": total_tokens}
if response.done:
return {"status": "completed",
"result": response.output,
"tokens_used": total_tokens,
"turns": turn + 1}
return {"status": "max_turns_reached",
"partial_result": response.output,
"tokens_used": total_tokens}
Where to set limits:
- Max turns: 5-10 for most agents. Increase for research agents, decrease for classification.
- Max tokens per turn: Model-dependent. 4K for gpt-4o-mini, 8K for gpt-4o.
- Total budget: ~50K tokens per task is generous for most agentic workflows.
Savings: 20-40% by preventing verbose spirals and limiting recovery attempts. Risk: truncating a critical completion mid-task.
Lever 4: Early Termination
Stop the agent loop when confidence is high or when additional steps show diminishing returns.
def should_terminate_early(response_history, confidence_threshold=0.9):
"""Stop if the agent is confident enough or if it's going in circles."""
# Rule 1: Agent explicitly signals completion
if "FINAL_ANSWER" in response_history[-1] or "task complete" in response_history[-1].lower():
return True, "agent_signaled_completion"
# Rule 2: Last 3 steps produced no new information
if len(response_history) >= 3:
last_three = response_history[-3:]
if all(len(r) < 200 for r in last_three):
return True, "diminishing_returns"
# Rule 3: Agent is repeating itself (circular loop)
if len(response_history) >= 2:
if response_history[-1][:200] == response_history[-2][:200]:
return True, "repeating_output"
# Rule 4: LLM estimates high confidence (expensive check)
if len(response_history) > 2:
score = estimate_confidence(response_history)
if score > confidence_threshold:
return True, f"high_confidence_{score:.2f}"
return False, None
Savings: 15-30% on per-task cost. Risk: terminating early on a task that needed 1-2 more steps to complete correctly. Best applied when false negatives are cheaper than unnecessary steps.
Lever 5: Compaction / Summarization
When conversation history exceeds the context window, summarize older messages instead of truncating them.
Without compaction:
Turn 50: Context = 45K tokens (full history)
Turn 100: Context window exceeded → oldest messages dropped
With compaction:
Turn 50: Context = 15K (last 10 turns + summary of earlier 40)
Turn 100: Context = 15K (last 10 turns + updated summary)
def compact_conversation(messages, keep_last=10):
"""Summarize older messages, keep recent ones in full."""
if len(messages) <= keep_last:
return messages
to_summarize = messages[:-keep_last]
recent = messages[-keep_last:]
summary_prompt = f"Summarize this conversation in 3 bullet points:\n\n{to_summarize}"
summary = llm.generate(summary_prompt, max_tokens=200)
return [{"role": "system", "content": f"Earlier conversation summary: {summary}"}] + recent
Built-in options: OpenAI and Anthropic both offer automatic compaction. Letta has memory blocks that the agent updates itself. The tradeoff is information loss vs cost savings.
Savings: 30-60% on context tokens for long-running agents. Risk: summarizing away the one detail the next question needs.
The Full Stack: Combining Levers
A production agent might use all five:
class OptimizedAgent:
def __init__(self):
self.router = ModelRouter() # Lever 1: model tiering
self.cache = PromptCache() # Lever 2: prompt caching
self.budget = TokenBudget(50000) # Lever 3: token budgeting
self.terminator = EarlyTerminator() # Lever 4: early termination
self.compactor = ContextCompactor() # Lever 5: compaction
def run(self, task):
while not self.is_done():
model = self.router.get_model(self.current_step_type())
response = self.call_llm(model, self.build_prompt())
self.budget.spend(response.usage.total_tokens)
if self.terminator.should_stop(self.history):
break
if self.budget.exhausted():
return self.partial_result()
if self.context_too_long():
self.compactor.compress()
return self.final_result()
Tracking Costs
You can't optimize what you don't measure. Track these per task:
def track_agent_metrics(run):
return {
"task_id": run.task_id,
"success": run.success,
"turns": len(run.steps),
"input_tokens": sum(s.usage.input_tokens for s in run.steps),
"cached_tokens": sum(s.usage.cached_tokens for s in run.steps),
"output_tokens": sum(s.usage.output_tokens for s in run.steps),
"total_cost": sum(s.usage.cost for s in run.steps),
"model_breakdown": Counter(s.model for s in run.steps),
"tool_calls": sum(len(s.tool_calls) for s in run.steps),
"premature_termination": run.terminated_early,
"compaction_events": run.compaction_count
}
Cost Reduction Summary
Values: 50-80% reduction
Values: 50-90% on input tokens
Values: 20-40% reduction
Values: 15-30% reduction
Values: 30-60% on context tokens
Note:
Don't optimize prematurely. An agent that costs $0.10 per task and runs 100 tasks a day costs $10/day. An agent that costs $5 per task and runs 100 tasks a day costs $500/day. Optimize the second agent. The first one is fine. Measure first, optimize second.
Key Takeaway
Prompt caching is the single highest-leverage cost optimization for agents — it's free (or nearly free) to implement and saves 50-90% on input tokens after the first turn. Implement it first. Model tiering is second: your agent's classification step doesn't need a frontier model. Token budgeting and early termination prevent runaway costs. Compaction keeps long-running agents sustainable. Track cost per task, not just total spend — it tells you whether optimizations are actually working.
Related Articles
Incident Runbook Agent Blueprint
AI agent that reads your on-call runbook, analyzes incident details, classifies severity, matches remediation steps, generates timelines, and drafts postmortems. Self-contained — works with markdown runbooks and pasted error logs.
OpenClaw Setup Guide
Complete setup and configuration guide for OpenClaw — the agent with the fastest GitHub star growth in history. Skills & Tools model, NVIDIA NemoClaw, Pi SDK engine, security hardening.
Vercel AI SDK Setup Guide
Complete setup and configuration guide for the Vercel AI SDK — the TypeScript toolkit for building AI applications with React, Next.js, and Node.js. Agents, tool calling, streaming, and chatbot UI hooks.