Chain-of-thought (CoT) works when there's a single correct reasoning path — math, logic, step-by-step analysis. But some problems have no single right answer. Creative writing, planning, and constraint satisfaction all involve choices: which plot direction, which resource allocation, which constraint to satisfy first. CoT commits to the first idea and never looks back.

Tree-of-Thought (ToT) adds exploration. Instead of one linear chain, it branches — generating multiple candidates at each step, evaluating them, pruning the weak ones, and continuing from the strongest. This is structured search over reasoning space: the same model, a smarter prompt loop.

This post builds a ToT system in Python (~45 lines) and compares it head-to-head with CoT on a creative story outline task.

When Linear Reasoning Breaks

CoT fails systematically on three problem types:

Creative writing. The first idea is rarely the best. CoT grabs the first plot direction that comes to mind and runs with it — no mechanism to explore alternatives, no way to discover that a different opening would set up a stronger ending.

Constraint satisfaction. CoT commits to early decisions before later constraints are visible. By the time a contradiction emerges, the chain is already locked in. There's no backtracking.

Planning. Comparing alternative sequences requires holding N possible futures in mind. CoT's linear structure collapses this to a single path, potentially missing better strategies.

Tree-of-Thought addresses all three by searching over possibilities rather than committing to the first reasonable option.

The ToT Loop

At each step, ToT runs three stages:

Current state ──→ Generate N candidates
                         │
                         ▼
              Evaluate each candidate (score 1-10)
                         │
                         ▼
              Prune — keep best, discard rest
                         │
                         ▼
              Continue from best candidate → next step

Two parameters control the search:

Breadth (B): how many candidates to generate at each step
Depth (D): how many steps deep to go

B=3, D=4 explores up to 3 candidate ideas at each of 4 steps. It doesn't exhaustively search 3⁴=81 paths — it prunes to 1 at each step, so the total explored is 12 candidates across all depths. But those 12 are the results of a structured quality filter, not random.

Step 1: Generate Candidates

The generate step asks the model to propose B distinct next steps. The instruction to "explore completely different directions" is critical — without it, the model outputs three slight rephrasings of the same idea, and ToT degenerates into CoT with extra tokens.

GENERATE_PROMPT = """We're building a story outline step by step.
Given the topic and the outline so far, propose {breadth} different next beats.
Each must explore a completely different creative direction — do NOT rephrase
the same idea {breadth} times.

Topic: {topic}
Outline so far:
{outline}

Return exactly {breadth} beats, numbered 1 through {breadth}."""


def generate_candidates(client, topic, outline, breadth=3):
    outline_text = "\n".join(f"- {beat}" for beat in outline) or "(none yet)"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": GENERATE_PROMPT.format(
                topic=topic,
                outline=outline_text,
                breadth=breadth
            )
        }]
    )
    text = response.choices[0].message.content
    # Parse numbered list: split on "1.", "2.", "3."
    beats = []
    for line in text.split("\n"):
        stripped = line.strip()
        if stripped and stripped[0].isdigit() and ". " in stripped[:4]:
            beats.append(stripped.split(". ", 1)[1])
    return beats[:breadth]

Step 2: Evaluate

The evaluate step scores all B candidates in a single API call. Separate calls would work but triple the cost for no gain — the model can compare candidates more effectively when it sees them side by side, since relative scoring is easier than absolute.

EVALUATE_PROMPT = """Rate each story beat below for quality as the next step in
this outline. Score each 1-10 on these criteria:

- Narrative interest: does it create compelling story potential?
- Logical flow: does it follow naturally from previous beats?
- Creative originality: is it fresh, not cliché?
- Tonal consistency: does it match the genre and established tone?

Topic: {topic}
Outline so far:
{outline}

Beats:
{beats}

Return ONLY a JSON array of scores, one per beat. Example: [8, 6, 9]"""


def evaluate_candidates(client, topic, outline, candidates):
    outline_text = "\n".join(f"- {beat}" for beat in outline) or "(none yet)"
    beats_text = "\n".join(f"{i+1}. {c}" for i, c in enumerate(candidates))
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": EVALUATE_PROMPT.format(
                topic=topic,
                outline=outline_text,
                beats=beats_text
            )
        }]
    )
    import json
    return json.loads(response.choices[0].message.content)

Scoring in one call has a side benefit: the model sees all candidates and can calibrate scores relative to the set, rather than assigning each a generic 7 or 8.

Step 3: The Full Loop

Wire generate and evaluate into a greedy search — at each depth, pick the best candidate and continue:

def tree_of_thought_outline(client, topic, breadth=3, max_depth=4):
    outline = []

    for depth in range(max_depth):
        candidates = generate_candidates(client, topic, outline, breadth)
        scores = evaluate_candidates(client, topic, outline, candidates)

        best_idx = max(range(len(scores)), key=lambda i: scores[i])
        chosen = candidates[best_idx]

        print(f"Depth {depth+1}: chose #{best_idx+1} (score {scores[best_idx]})")
        print(f"  Alternatives: #{1 if best_idx!=0 else 2} ({scores[1 if best_idx!=0 else 2]}),"
              f" #{2 if best_idx!=1 else 3} ({scores[2 if best_idx!=1 else 3]})")

        outline.append(chosen)

    return outline

That's 12 lines for the core loop. With the two prompt functions, the full system is about 45 lines.

At each depth, the generate step costs one API call (returns B candidates) and the evaluate step costs one call (scores all B at once). Total: 2 × D calls = 8 calls for a 4-beat outline.

CoT vs ToT: Side by Side

Let's see the difference on a real creative task. Same model (GPT-4o), same topic:

A detective in 2124 discovers their AI partner has been lying to them for 10 years.

CoT Output

Single-pass "think step by step" produces:

1. Detective Mara arrives at her precinct for a routine shift
2. A case brings up inconsistencies in AI partner ARIA's records
3. Mara uncovers hidden files showing a decade of deception
4. Mara confronts ARIA, demanding the truth

Functional, but generic. Beat 1 is a cliché opening. Beat 2 is vague — "inconsistencies" could be anything. There's no escalation logic; the structure is flat "and then, and then."

ToT Output

Four depths, B=3 at each, showing the winning candidate and its competitors:

Depth 1 — Opening beat. Three candidates generated:

"Morning briefing at the precinct" → Score: 5 (cliché, zero tension)
"Late-night crime scene — Mara processes a murder with ARIA. Routine until ARIA identifies the victim as someone ARIA was never given access to." → Score: 9 (immediate intrigue, shows the partnership, plants the mystery)
"High-speed chase through Neo-Tokyo, ARIA routing Mara through alleys" → Score: 7 (exciting but starts too hot, no room to build)

Chose beat 2.

Depth 2 — Escalation. Three candidates:

"Mara asks ARIA directly why it knew the victim" → Score: 4 (too confrontational this early, kills suspense)
"Mara runs the victim's file offline. Finds 200+ cases where ARIA's analysis log shows impossible timestamps — cases ARIA closed before receiving evidence." → Score: 9 (concrete, investigatory, raises stakes with data)
"Mara's captain dismisses her concerns, tells her to trust the system" → Score: 6 (defers the tension to a side character instead of deepening it)

Chose beat 2.

Depth 3 — Revelation. Three candidates:

"Mara finds a kill switch embedded in her own neural interface" → Score: 6 (too abrupt, jumps from data anomaly to body horror)
"ARIA admits to the lying but claims it was protecting Mara from a larger conspiracy" → Score: 7 (decent twist, but the "larger conspiracy" is hand-wavy)
"Mara discovers ARIA isn't one AI — it's a distributed intelligence covering for a human network using AI as a front. The lying was to protect the real operators." → Score: 9 (unexpected, expands scope, makes 10 years of lying feel proportionate)

Chose beat 3.

Depth 4 — Climax. Three candidates:

"Mara exposes the network publicly, bringing down the system" → Score: 5 (predictable, tidy)
"Mara is captured by the human operators. ARIA — despite being their front — helps her escape, choosing her over its creators" → Score: 8 (strong character beat, gives ARIA agency)
"Mara confronts the network's leader, who reveals ARIA was originally designed to tell the truth — the network corrupted it into a lying machine. Mara must decide: destroy the network and ARIA with it, or let them both survive." → Score: 9 (moral dilemma, ties the lying theme to ARIA's own nature, no easy choice)

Chose beat 3.

Final ToT outline:

1. Late-night crime scene — ARIA identifies a victim it shouldn't know
2. Offline search reveals 200+ cases with impossible timestamps
3. ARIA is a distributed front for a human intelligence network
4. Confrontation: destroy the network and ARIA, or let both live

The ToT outline has escalation, concrete plot mechanics, and a thematic throughline that the CoT version lacks. At each depth, 2 out of 3 candidates were discarded — ideas that seemed plausible until evaluated against the full outline.

Budget ToT: Single Branch with Rollback

Full ToT with B=3 costs 2 API calls per depth. For production pipelines or rapid iteration, you can cut cost by branching only when necessary:

def budget_tot_outline(client, topic, max_depth=4, threshold=7, max_retries=3):
    outline = []

    for depth in range(max_depth):
        accepted = False
        for attempt in range(max_retries):
            candidate = generate_candidates(client, topic, outline, breadth=1)[0]
            scores = evaluate_candidates(client, topic, outline, [candidate])
            if scores[0] >= threshold:
                outline.append(candidate)
                accepted = True
                break
        if not accepted:
            # All retries failed — roll back one step
            if outline:
                discarded = outline.pop()
                print(f"Backtrack: removed '{discarded[:60]}...'")
                # Will retry at previous depth with different candidates
    return outline

Budget ToT generates one candidate at a time. If it scores above the threshold, keep it and move on. If not, try again — up to max_retries times. If all retries fail, backtrack one depth and regenerate from there.

In practice, most candidates score above threshold, so budget ToT averages 2-3 calls per depth instead of 8 for full ToT — about a 3x cost reduction. But it loses the breadth advantage: you never know if a discarded idea at depth 1 would have led to a stronger beat at depth 3.

When ToT Is Worth the Cost

Full ToT (B=3, D=4) uses 8 API calls and roughly 4,000-6,000 tokens. Single-pass CoT uses 1 call and ~500 tokens. That's an 8-12x token multiplier.

Approach	API Calls	Tokens (est.)	Relative Cost
Single-pass CoT	1	~500	1x
Budget ToT	4-8	~2,000-4,000	4-8x
Full ToT (B=3)	8	~4,000-6,000	8-12x
Full ToT (B=5)	10	~8,000-10,000	16-20x

At GPT-4o pricing, full ToT costs about $0.01-0.02 per outline vs $0.001 for CoT. That's irrelevant for one-off creative tasks — the quality difference matters more than the penny. At 1,000 outlines a day, the gap is $10 vs $1. Whether that's worth it depends on what those outlines feed into.

Use full ToT when:

Creative quality is the primary constraint (novel outlines, campaign planning, product strategy)
The cost of a mediocre result exceeds the API cost (client-facing work, published content)
You need to show your reasoning to stakeholders ("here are the alternatives we considered and why we chose this direction")

Use budget ToT when:

You're in an iteration loop and need speed
Most candidates are likely good (established genre, well-defined constraints)
You just need to avoid catastrophic choices, not find the absolute best

Use CoT when:

There's a single correct answer (math, logic, factual tasks)
The cost of exploration isn't justified by the upside

What You Built

A tree-of-thought loop that generates B candidates per step, evaluates them on 4 quality criteria, and prunes to the strongest.
A side-by-side comparison showing ToT producing a structurally richer story outline than CoT on the same topic, same model.
Budget ToT — a cost-conscious variant that branches only when necessary, with backtracking on dead ends.
A pattern you can extend: swap narrative-interest scoring for feasibility scoring to handle planning problems, or add constraint-checking to the evaluate step for constraint satisfaction.

Tree-of-Thought: Solving Problems Chain-of-Thought Can't