Back to blog

Tuesday, June 16, 2026

OpenAI Agents SDK: Architecture Deep-Dive and Framework Comparison

cover

What Is the OpenAI Agents SDK?

The OpenAI Agents SDK is an orchestration framework for designing, building, and deploying LLM-powered agents. Released as a lightweight Python package (pip install openai-agents), it's a production-grade evolution of OpenAI's earlier Swarm experiment — same minimal API surface, but built for real-world reliability.

The SDK occupies a specific spot in the stack. It's not a model API (that's the Responses API and Chat Completions API). It's not a platform tool (that's the now-deprecated Agent Builder). It's an agent runtime — a library that manages the agent loop, dispatches tools, executes handoffs, runs guardrails, and collects traces.

Your Application
    ↓ SDK (Agent + Runner)
    ↓ Models (Responses API / Chat Completions / LiteLLM)
    ↓ Tools (function_tool / MCP / Hosted Tools)

This matters because OpenAI is deprecating the Assistants API (sunset August 26, 2026). The SDK is the recommended path forward for anyone who was building on Assistants and needs a framework to manage multi-turn, multi-tool, multi-agent workflows.

What It Isn't

  • Not a replacement for direct API calls. If you need one LLM call with one tool, call the Responses API directly. The SDK adds 0 value and 1 dependency.
  • Not a graph engine. LangGraph this is not. The SDK doesn't model workflows as directed graphs with state nodes and edges. It models agents as autonomous tools that can call other agents.
  • Not a platform. You host it. OpenAI doesn't run your agent loop.

Architecture: Four Primitives

The SDK is built on a deliberately small set of abstractions. The entire framework fits in your head after reading one page of docs.

1. Agent

An Agent is an LLM configured with instructions, tools, and optional runtime behavior:

from agents import Agent

support_agent = Agent(
    name="Support Agent",
    instructions="You handle general support inquiries.",
    tools=[search_knowledge_base, escalate_to_human],
    model="gpt-4o-mini",  # per-agent model override
)

The agent owns:

  • Instructions — system prompt. Can use prompt injection for dynamic behavior.
  • Tools — function tools, hosted tools (WebSearch, FileSearch, Computer), MCP servers.
  • Handoffs — references to other Agent objects. The LLM can choose to delegate.
  • Guardrails — input and output validation functions that run alongside execution.
  • Output type — optional Pydantic model for structured output.
  • Model override — each agent can use a different model.

The key insight: an Agent is a lightweight descriptor, not a long-lived process. It defines LLM instructions and capabilities. The Runner brings it to life.

2. Runner

The Runner is the execution engine. It manages the agent loop:

  1. Call the LLM with the agent's instructions and conversation history
  2. If the LLM returns tool calls, execute them
  3. Feed tool results back to the LLM
  4. If the LLM returns a handoff, transfer control to the target agent
  5. Repeat until done or max_turns reached
from agents import Agent, Runner

result = Runner.run_sync(
    support_agent,
    "I was charged twice for my subscription.",
)
print(result.final_output)

Runner.run() is the async version. Both return a RunResult that includes final_output, the full conversation history, and trace metadata.

The runner is where the SDK earns its keep. Without it, you'd write this loop yourself — calling the API, parsing tool calls, executing functions, checking for handoffs, re-entering the conversation. With it, you define the agents and the SDK manages state transitions.

3. Tools

Tools come in three flavors:

Function Tools — any Python function decorated with @function_tool:

from agents import function_tool

@function_tool
def get_weather(city: str) -> str:
    """Get the weather for a given city."""
    return f"The weather in {city} is sunny."

The SDK uses Python's inspect module for signature extraction, griffe for docstring parsing, and pydantic for schema generation. The OpenAI function-calling schema is generated automatically.

The @function_tool(defer_loading=True) option hides a function tool until a ToolSearchTool() runtime helper loads it — useful for agent toolkits where you want to register dozens of tools but only make the relevant ones visible.

Hosted Tools — OpenAI-managed tools that require no code:

  • WebSearchTool() — browse the web
  • FileSearchTool() — search uploaded files
  • ComputerTool() — computer-use actions (beta)
  • CodeInterpreterTool() — execute code in a sandbox

MCP Tools — any Model Context Protocol server:

from agents.mcp import MCPServerStdio

async with MCPServerStdio(
    name="Filesystem Server",
    params={"command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "/tmp"]}
) as server:
    agent = Agent(
        name="File Agent",
        instructions="Manage files for the user.",
        mcp_servers=[server],
    )

4. Handoffs

Handoffs are the SDK's multi-agent mechanism. An agent delegates to another by including the target agent in its handoffs list. The LLM decides when to hand off — the SDK recognizes the handoff signal in the LLM's response and transfers control automatically.

billing_agent = Agent(
    name="Billing Agent",
    instructions="Handle billing inquiries: charges, invoices, payment methods, refunds.",
)

triage_agent = Agent(
    name="Triage Agent",
    instructions="Route users to the right agent. For billing, hand off to Billing Agent.",
    handoffs=[billing_agent],
)

result = Runner.run_sync(triage_agent, "I was charged twice.")

Under the hood, handoffs work by rewriting the conversation to simulate a clean handoff context. The incoming agent sees only the relevant conversation history — not every previous turn from every agent. The SDK provides built-in handoff prompt templates (agents.extensions.handoff_prompt.prompt_with_handoff_instructions) to make this natural.

Handoffs stay within a single Runner.run() call. Input guardrails apply only to the first agent in the chain; output guardrails only to the agent producing the final output.

Guardrails: Safety at Every Layer

Guardrails are validation functions that run in parallel with agent execution. The SDK supports three scopes:

Guardrail TypeWhen It RunsWhat It Guards
@input_guardrailBefore the LLM processes user inputPrevents bad prompts from reaching the model
@output_guardrailBefore output reaches the userBlocks harmful or off-topic responses
Tool guardrailsBefore/after each custom function-tool callValidates tool inputs and outputs
from agents import input_guardrail, output_guardrail, GuardrailFunctionOutput

@input_guardrail
async def no_pii_guardrail(context, agent, input):
    if any(p in str(input).lower() for p in ["credit card", "ssn", "social security"]):
        return GuardrailFunctionOutput(tripwire_triggered=True)
    return GuardrailFunctionOutput(allow=True)

A guardrail with tripwire_triggered=True halts execution immediately. This is the SDK's primary safety mechanism and it's refreshingly simple — a guardrail is just a function that returns a boolean.

Sessions: Persistent Memory

Sessions maintain working context across agent turns. The SDK supports pluggable backends:

  • SQLite — development and single-process deployments
  • Redis — multi-process, multi-instance persistence
  • Postgres (via SQLAlchemy) — production deployments with HA requirements
from agents import Agent, Runner, SQLAlchemySession

session = await SQLAlchemySession.create(
    connection_string="sqlite:///agent_memory.db"
)

agent = Agent(name="Memory Agent", instructions="Remember user preferences.")

result = await Runner.run(agent, "My name is Alice.", session=session)
result = await Runner.run(agent, "What's my name?", session=session)
# → "Your name is Alice."

Sessions store conversation history, tool call results, and any application context you inject via RunContextWrapper. The SDK handles serialization and deserialization — you provide the connection string.

Tracing: Observability Built In

Every agent run is automatically traced. The SDK collects:

  • LLM generations (prompt, completion, token counts)
  • Tool calls (function name, input, output, duration)
  • Handoffs (source, target, reason)
  • Guardrail checks (which guardrail, result, duration)
  • Custom events you add with agents.trace()

Traces are sent to OpenAI's Traces dashboard by default. You can also export to your observability stack — Datadog, Grafana, or a custom processor via TraceProcessor.

from agents import trace

with trace(workflow_name="Customer Support"):
    result = Runner.run_sync(triage_agent, "I need help with billing.")

This is one of the SDK's strongest features. Most agent frameworks ship tracing as an add-on or leave it to third-party tools. The SDK makes it free and automatic — you get it from the first Runner.run_sync().

Multi-Agent Orchestration Patterns

The SDK supports three distinct multi-agent patterns:

Pattern 1: Router/Supervisor

A supervisor agent routes requests to specialist agents. This is the most common pattern and the one the SDK handles best:

User → Triage Agent → [Billing Agent, Support Agent, Sales Agent]

Pattern 2: Agent as Tool

One agent exposes another agent as a tool it can call. The parent agent decides when to invoke the child, and the child runs with its own instructions and tools:

@function_tool
def delegate_to_research(query: str) -> str:
    result = Runner.run_sync(research_agent, query)
    return result.final_output

coordinator = Agent(
    name="Coordinator",
    instructions="You coordinate research tasks.",
    tools=[delegate_to_research],
)

Pattern 3: Parallel Agents

Multiple agents work on independent subtasks, and a final agent synthesizes results. The SDK doesn't have built-in parallel execution — you manage this yourself with Python's asyncio.gather() or a task queue:

import asyncio
from agents import Agent, Runner

research_agent = Agent(name="Researcher", instructions="Research the topic.")
writing_agent = Agent(name="Writer", instructions="Write the draft.")
review_agent = Agent(name="Reviewer", instructions="Review and improve.")

async def run_parallel():
    research, draft, review = await asyncio.gather(
        Runner.run(research_agent, "Topic: LLM agents"),
        Runner.run(writing_agent, "Write about LLM agents"),
        Runner.run(review_agent, "Review this article"),
    )
    return synthesize(research.final_output, draft.final_output, review.final_output)

This works but lacks built-in coordination. If you need parallel agents with shared state, dependency ordering, and error propagation, LangGraph's graph model may be a better fit.

How It Compares

LangGraph (LangChain)

LangGraph models agent workflows as directed graphs with explicit state nodes, edges, and conditional transitions. It's the most powerful and most complex option.

DimensionLangGraphOpenAI Agents SDK
Core abstractionState graph with nodes and edgesAgent descriptor + Runner loop
Control flowExplicit via graph edgesImplicit via LLM+handoffs
State managementManual — you design the state schemaAutomatic via Runner + Sessions
Parallel executionBuilt-in (fan-out/fan-in via nodes)Manual (asyncio.gather)
Human-in-the-loopFirst-class (checkpoints, approval nodes)Via guardrails (limited)
ObservabilityLangSmith (comprehensive, paid tier)Built-in tracing (free with API key)
Learning curveSteep — graph design, state, routingGentle — agents, tools, handoffs
LLM support50+ providers via LangChainOpenAI + 100+ via LiteLLM

Pick LangGraph when: you need explicit control over execution flow, durable state for long-running workflows, sophisticated human-in-the-loop approval chains, or parallel agent execution with dependency management.

Pick OpenAI SDK when: you want a fast start with handoff-based multi-agent, built-in tracing is sufficient, your workflows are tree-shaped (router → specialist) rather than graph-shaped, and you're already on OpenAI infrastructure.

AutoGen (Microsoft)

AutoGen models multi-agent workflows as conversations between agents. An agent generates, another critiques, a third summarizes. The emergent behavior from conversation patterns is AutoGen's superpower.

DimensionAutoGenOpenAI Agents SDK
Core abstractionConversational agent groupsAgent + Runner
Multi-agent modelGroup chat, nested chat, two-agent chatHandoffs (agent calls agent)
Human-in-the-loopFirst-class — agent proposes, human approvesVia guardrails (basic)
Code executionBuilt-in sandboxVia SandboxAgent (beta)
Research backingPublished papers on conversation patternsProduction-oriented, less research
StabilityActive development with breaking changesProduction-focused, stable API
LLM supportOpenAI-focused via configOpenAI + 100+ via LiteLLM

Pick AutoGen when: your workflow benefits from emergent behavior through agent conversation — code review cycles where critic and author iterate, or research loops where agents challenge each other's assumptions. AutoGen's conversation patterns handle reciprocal direction changes naturally (agent A → agent B → agent A), which handoffs don't.

Pick OpenAI SDK when: you want deterministic handoffs between clearly defined specialists, your workflow is tree-shaped (one agent routes to one specialist), or you need built-in tracing without setting up a separate observability stack.

CrewAI

CrewAI models multi-agent systems as teams with roles. Agents have roles, goals, and backstories. Tasks have clear owners. This maps beautifully to how humans think about team workflows.

DimensionCrewAIOpenAI Agents SDK
Core abstractionRoles + Tasks + CrewAgent + Runner
Prototyping speedFastest — 30 lines for multi-agentFast — 50 lines for multi-agent
Tool integrationBasic, thinner docsRich (function_tool, MCP, hosted tools)
Execution modelSequential tasks per roleHandoff-driven, less structured
ObservabilityLimitedBuilt-in tracing
Production readinessApproaching — limited checkpointingHigh — tracing, guardrails, sessions
LLM supportOpenAI-focusedOpenAI + 100+ via LiteLLM

Pick CrewAI when: you need to prototype a multi-agent workflow in an afternoon, your agents have clear role-based responsibilities that map to sequential tasks (research → write → review), and you accept the gap between prototype and production.

Pick OpenAI SDK when: production reliability matters — guardrails, tracing, and session persistence are priorities; you need rich tool integration with MCP; or your multi-agent pattern is routing/supervisor rather than role-based pipelines.

Feature Comparison Matrix

FeatureOpenAI SDKLangGraphAutoGenCrewAI
Agent definitionPython classGraph nodeAgent classRole-based
Tool registration@function_tool, MCP, hostedTool decoratorTool classTool class
Multi-agent orchestrationHandoffsGraph edgesGroup chatSequential tasks
GuardrailsBuilt-in (3 scopes)Via callbacksLimitedNone
TracingBuilt-in, freeLangSmith (paid tiers)Third-partyLimited
Human-in-the-loopVia guardrailsFirst-class (checkpoints)First-classVia tasks
MCP supportNativeCommunity adaptersCommunity adaptersVia plugins
Parallel executionManual (asyncio)Built-in (fan-out)Via group chatSequential only
Sessions/memoryBuilt-in (SQLite/Redis/PG)Built-in (checkpointer)Via agentsLimited
LLM provider supportOpenAI + 100+ via LiteLLM50+ providersOpenAI-focusedOpenAI-focused
Production readinessHighHighestMediumApproaching
Learning curveLowHighMediumLow
LicenseMITMITMIT (AG2)MIT

Limitations

The SDK is not a silver bullet. Here's what it doesn't do well:

Vendor lock-in. The SDK is provider-agnostic on paper (it supports any LLM via LiteLLM), but in practice the best features — tracing dashboard, hosted tools, sandbox agents — are OpenAI-only. Non-OpenAI models lose the built-in hosted tools and may not get full tracing fidelity.

No parallel execution. The SDK runs one agent at a time. Handoffs are sequential. If you need parallel agents with result merging, you write the coordination yourself. LangGraph handles this natively with fan-out/fan-in graph nodes.

Limited human-in-the-loop. Guardrails can halt execution, but they can't pause and resume. AutoGen's model — where an agent proposes, a human approves, the agent continues — isn't achievable with the current SDK. You'd need to build a custom workflow around the guardrail tripwire.

OpenAI model pricing. Running the SDK with OpenAI models means paying per-token at standard API rates. For high-volume agent deployments with many turns per session, costs can add up quickly. The SDK doesn't include cost control or budgeting.

Sandbox agents are beta. The isolated workspace execution (Docker sandbox for code running) is in beta. The API and capabilities are expected to change before GA.

No TypeScript support yet. The SDK is Python-only. TypeScript is "planned for a future release."

Pitfalls

What I learned the hard way evaluating this SDK:

Max turns matters. The default max_turns=10 in Runner.run_sync() is easy to hit in multi-handoff workflows. Each handoff costs a turn. Three agents with two tools each can exhaust 10 turns before producing output. Either raise max_turns or set max_turns=None to disable the limit entirely (available in recent SDK versions).

Tracing costs nothing but requires an OpenAI API key. You can use tracing with non-OpenAI models — the SDK lets you set a separate tracing key via set_tracing_export_api_key(). But you need at least one valid OpenAI API key even if you're running on Anthropic or local models.

Handoff history management. By default, when agent A hands off to agent B, the conversation rewrites to give B a clean context. The SDK provides filters in agents.extensions.handoff_filters for common rewrite strategies. If you need the full conversation visible to all agents, you'll need a custom handoff filter.

Tool guardrails fire on every call. If you register a tool guardrail on a frequently-used function, it runs on every invocation. Keep guardrail logic cheap — no heavy model calls or external API requests in the guardrail path.

Session serialization can surprise you. Sessions persist everything including tool outputs. If your tools return large data (files, images, long search results), the session database grows fast. Set a retention policy or prune old sessions.

When to Use the OpenAI Agents SDK

Good Fit

  • You're building on OpenAI infrastructure. If your stack already uses OpenAI models and the Responses API, the SDK adds agent orchestration with minimal new surface area.
  • Your workflow is router-based. A supervisor agent that hands off to specialists is the SDK's best pattern. It handles this naturally and efficiently.
  • You need built-in guardrails and tracing. If these are compliance requirements, the SDK ships them ready to go. No assembly required.
  • You're migrating from the Assistants API. The August 2026 sunset makes this urgent. The SDK is the recommended migration path.

Bad Fit

  • You need parallel agents with coordinated results. LangGraph handles this better with graph-structured fan-out/fan-in.
  • Your workflow needs multi-step human approval. AutoGen's conversation model with embedded human review is a better fit.
  • You're on a tight token budget. The SDK abstracts away token tracking per step. You can inspect traces post-hoc, but there's no built-in cost governor.
  • You need TypeScript support. It's coming but not here yet.

The Bottom Line

The OpenAI Agents SDK is the most batteries-included agent framework on the market for OpenAI-native stacks. Guardrails, tracing, sessions, MCP integration, and handoff-based multi-agent orchestration ship in a single pip install. The SDK's minimal abstraction layer — agents, tools, handoffs, guardrails — lets you build production agent workflows without learning a complex framework API.

But the SDK is opinionated. Handoffs are the only multi-agent pattern. Parallel execution is your responsibility. Human-in-the-loop beyond simple tripwires requires custom work. And for all its provider-agnostic claims, the SDK's best features are OpenAI-only.

Reach for the OpenAI Agents SDK when: you're on OpenAI, your multi-agent pattern is router-to-specialist, and you want guardrails and tracing without configuring external tools.

Reach for LangGraph when: you need explicit state control, parallel execution, and durable checkpoints — and you can invest in learning the graph model.

Reach for AutoGen when: your value is in emergent conversation patterns — agents that critique, iterate, and converge through dialogue.

Reach for CrewAI when: you're prototyping and need a multi-agent system working in an afternoon. Then evaluate whether the pattern justifies a production framework investment.

For a practical setup guide covering installation, agent configuration, and common patterns, see the OpenAI Agents SDK Setup Guide. For the broader agent framework landscape including LangChain and smaller players, see AI Agent Frameworks Compared (2026).