Prompt Chaining

Prompt chaining decomposes complex tasks into sequential steps where each output feeds the next. Instead of one massive prompt hoping the model gets everything right, you break the work into focused subtasks — each with its own prompt, its own quality check, and a clear handoff to the next step.

It's the simplest multi-step pattern to implement: wire two LLM calls together, pass the output of call 1 as input to call 2. No planning agents, no tool loop, no autonomous decisions. Just structured decomposition.

Note:

If you're new to multi-step prompting, start here. Chaining is easier to build and debug than agentic prompting or RAG pipelines. Graduate to those when you need dynamic decision-making or external retrieval.

Sequential Chaining

Break a task into ordered steps. Each step processes the previous output and hands it forward. If step 3 fails, you retry step 3 — not steps 1 and 2.

Basic Pattern

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ Step 1  │───→│ Step 2  │───→│ Step 3  │───→│  Final  │
│  Plan   │    │  Draft  │    │  Polish │    │  Output │
└─────────┘    └─────────┘    └─────────┘    └─────────┘

Step 1 — Plan

Create a detailed outline for a blog post about Kubernetes for beginners.
Include 4-5 H2 sections with 2-3 bullet points of key topics under each.

Output format:
## Section Title
- Key point 1
- Key point 2

Step 2 — Draft

Expand each section of this outline into 2-3 paragraphs. Write in a clear,
conversational tone suitable for beginners. Include concrete examples.

Outline:
{outline_from_step_1}

Previous decisions: The post targets complete beginners with no DevOps
experience. Tone is friendly, not academic.

Step 3 — Polish

Refine this draft for clarity, grammar, and accuracy. Add one concrete
example or analogy per section. Remove any jargon without explanation.

Draft:
{draft_from_step_2}

Original goal: Beginner-friendly Kubernetes guide.
Sections already covered: {already_covered}

Wiring It in Code

from openai import OpenAI

client = OpenAI()

def chain(question: str) -> str:
    # Step 1: Plan
    plan = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.3,
        messages=[{"role": "user", "content": f"Create a detailed outline for: {question}"}]
    )
    outline = plan.choices[0].message.content

    # Gate check — if outline is empty, retry or fail
    if not outline or len(outline) < 50:
        raise ValueError("Step 1 produced insufficient output")

    # Step 2: Draft
    draft = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.7,
        messages=[{"role": "user", "content": f"""Expand this outline into full paragraphs.

Outline:
{outline}

Write in a clear, conversational tone."""}]
    )
    text = draft.choices[0].message.content

    # Step 3: Polish
    final = client.chat.completions.create(
        model="gpt-4o",
        temperature=0.3,
        messages=[{"role": "user", "content": f"""Refine this draft. Fix grammar, improve clarity, add
one concrete example per section.

Draft:
{text}"""}]
    )
    return final.choices[0].message.content

Key details in the code:

Gate checks between steps catch failures early. If step 1 produces garbage, don't waste tokens on steps 2 and 3.
Temperature varies by step. Plan and polish use 0.3 (deterministic). Draft uses 0.7 (creative).
Context is carried forward. Each step gets the output and a summary of what was decided.

Routing

Classify the input into a category, then dispatch to a specialized handler prompt optimized for that category.

Classify the user query into exactly one category. Output only the category
name and a confidence score from 0.0 to 1.0.

Categories:
- pricing: Questions about cost, plans, billing
- technical: Setup, configuration, troubleshooting
- account: Login, permissions, profile management
- feedback: Feature requests, complaints, suggestions
- general: Everything not covered above

Query: {user_query}

Category:
Confidence:

If confidence is below 0.7, route to a disambiguation prompt before the handler:

I want to make sure I understand your question correctly. Did you mean:

A) How to set up {topic} from scratch
B) How to fix an existing {topic} configuration
C) How to compare {topic} with alternatives
D) Something else — please clarify

Reply with just the letter.

Each handler has its own prompt:

You are a technical support specialist for {product_name}.
The user needs help with setup or configuration.

Rules:
- Give step-by-step instructions
- Include copy-pasteable commands where relevant
- Ask clarifying questions if the OS or version is not specified
- Point to official docs for advanced options

User question: {user_query}

Wiring routing in code:

def route_and_respond(query: str) -> str:
    # Step 1: Classify
    category, confidence = classify_query(client, query)

    # Step 2: Disambiguate if unsure
    if confidence < 0.7:
        query = disambiguate(client, query, category)

    # Step 3: Dispatch to handler
    handler = HANDLERS.get(category, HANDLERS["general"])
    return handler(client, query)


HANDLERS = {
    "pricing": respond_pricing,
    "technical": respond_technical,
    "account": respond_account,
    "feedback": respond_feedback,
    "general": respond_general,
}

A catch-all general handler prevents routing dead ends. If no category matches, the system still responds.

Parallelization

Run independent subtasks concurrently, then combine results through an aggregator step.

You are researching {topic} from {n} different angles. Your assigned
perspective is: {perspective_name}

Rules:
- Research only your assigned angle
- Provide specific data points with sources
- Do not reference other perspectives — the aggregator handles that
- Format findings as structured bullet points

Research focus: {focus_area}

Run this prompt N times simultaneously — one per perspective. Then aggregate:

Synthesize these {n} research reports into a unified analysis.

{p1}: {report_1}
{p2}: {report_2}
{p3}: {report_3}

Output:
1. Comparison table showing key differences across perspectives
2. Areas of agreement (what all perspectives concur on)
3. Contradictions or conflicting findings (flag explicitly, don't smooth over)
4. Key takeaways as a prioritized list

Wiring parallel execution:

import asyncio

async def parallel_research(topic: str, perspectives: list[tuple[str, str]]) -> str:
    tasks = [
        asyncio.create_task(
            research_perspective(client, topic, name, focus)
        )
        for name, focus in perspectives
    ]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    # Filter out failed subtasks — don't block on one failure
    completed = [(p[0], r) for (p, r) in zip(perspectives, results)
                 if not isinstance(r, Exception)]

    return aggregate_findings(client, topic, completed)

return_exceptions=True prevents one failed subtask from killing the entire pipeline. The aggregator notes which perspectives completed and which are missing.

Chaining vs Agentic Prompting

Both handle multi-step tasks. Here's when to pick each:

Pattern	Best For	Decision Logic	Recovery
Chaining	Fixed workflow, known steps ahead of time	Hardcoded — step 1 always calls step 2	Retry the failing step
Agentic	Dynamic tasks, uncertain number of steps	LLM decides next action at runtime	Re-plan from current state

Use chaining when you know the pipeline structure in advance. Use agentic prompting when the model needs to decide what to do next.

Chaining is simpler, faster, and cheaper. Agentic is more flexible but costs more tokens and adds latency.

Combining with Other Techniques

Chaining composes well with other prompting patterns:

Chain + CoT. Each step in the chain uses chain-of-thought internally:

Step 2: Draft each section.
[outline from Step 1]

For each section, think through:
- What's the one concept the reader needs to understand?
- What analogy would make this click?
- What order should the paragraphs go in?

Then write the section.

Chain + RAG. Retrieve documents before a chain step that needs external knowledge:

# Before Step 2 (draft), retrieve relevant docs
docs = retrieve(question=outline, top_k=3)
draft = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Use these sources:\n{docs}\n\nExpand outline:\n{outline}"
    }]
)

Chain + Tool Use. A step calls an external tool, then the next step processes the result:

# Step 1: Generate a SQL query
sql = client.chat.completions.create(...)

# Tool: Execute the query
results = db.execute(sql)

# Step 2: Format results into plain English
explanation = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": f"Results: {results}\n\nSummarize these findings for a non-technical audience."
    }]
)

Common Failure Modes

Information loss between steps. Each LLM call may drop or distort details from the previous step. Mitigation: include a summary field alongside the raw output in each handoff.

[Full output from Step 1: {output}]
[Summary: The outline has 5 sections covering intro, core concepts,
hands-on tutorial, common pitfalls, and next steps.]
[Current step: Draft section 3 — "Hands-on tutorial"]

Compounding errors. An error in step 2 gets magnified in steps 3 and 4. Mitigation: add validation gates between steps.

if not draft or len(draft) < 200:
    raise ValueError(f"Step 2 produced insufficient output: {len(draft)} chars")

Context drift. Long chains lose focus as each step subtly shifts the task. Mitigation: re-state the original goal in each step's prompt:

Remember, the end goal is a beginner-friendly Kubernetes guide.
Keep everything at that level. Don't assume prior DevOps knowledge.

Routing dead ends. No category matches and the system has no fallback. Mitigation: always include a general catch-all handler.

Parallel task conflicts. Two subtasks produce contradictory findings. Mitigation: instruct the aggregator to flag contradictions explicitly rather than smoothing them over.

Troubleshooting

Symptom	Likely Cause	Fix
Chain output too short	Step instructions too vague	Add length requirement in that step's prompt
Chain output too long	No length constraint	Specify max words or section count per step
Router picks wrong category	Categories too broad or overlapping	Make categories mutually exclusive; test with edge cases
Parallel results conflict	Subtasks share hidden dependencies	Check subtask boundaries; reduce overlap in research focus
Quality degrades after step 3	Context drift from original goal	Re-inject the original task description at step 3
Chain fails silently on step N	No gate check before step N+1	Add validation between every step
High latency from unnecessary steps	Every query runs the full chain	Route simple queries to a single-step handler

Best Practices

Validate at each step. Check output meets minimum requirements before passing forward.
Keep steps focused. Each step does one transformation. If a step does two things, split it.
Set max iterations. For chains with retry logic, cap at 3 retries per step to avoid loops.
Carry context forward. Include the output, a summary, and what was already decided.
Handle failures gracefully. A failed step shouldn't crash the pipeline — log it, retry, or skip.
Design for testability. Each step is an independent function you can test with mock data.
Start with a 2-step chain. Don't build a 7-step pipeline on day one. Get a 2-step chain working, then add steps.

Note:

Pro tip: The simplest chain that solves the problem is the best chain. A 2-step pipeline that works reliably beats a 7-step pipeline that breaks on edge cases.

Prompt Chaining: Multi-Step AI Workflows