The Reasoning Spectrum

Prompting techniques form a progression in expressive power:

Chain-of-Thought (CoT) → linear, one path
  ↓
Tree-of-Thought (ToT)  → branching, evaluate and prune
  ↓
Graph-of-Thought (GoT) → merging, backtracking, combining partial solutions

Graph-of-Thought (Besta et al. 2023) models reasoning as a directed graph where thoughts are nodes and dependencies are edges. This enables operations that linear chains and trees cannot express.

Core Operations

Operation	What It Does	When You Need It
Generate	Create a new thought from existing ones	Standard step forward
Aggregate	Merge multiple thoughts into one	Combining partial results
Refine	Improve a thought based on feedback	Iterative improvement
Backtrack	Return to an earlier thought and try a different path	Dead-end recovery
Score	Evaluate thought quality for pruning	Quality control

When GoT Beats ToT and CoT

GoT wins when partial solutions can be combined:

Sorting problems: Breaking a list into sublists, sorting each, merging results. Sorting 64 numbers: 32-way split → sort individually → merge.
Document synthesis: Multiple documents each cover part of a topic. Generate summaries → aggregate into unified report.
Parallel sub-problems: A problem decomposes into independent sub-problems whose solutions combine additively.

ToT wins when branching is the primary need:

Creative writing with multiple candidate outlines
Game playing with multiple move sequences
Planning with a clear evaluation function

CoT wins when the reasoning is linear:

Simple arithmetic
Step-by-step instructions
Sequential deduction

Implementation Sketch

class GraphOfThought:
    def __init__(self, llm, scorer):
        self.llm = llm
        self.scorer = scorer  # function(thought) → score
        self.nodes = {}  # id → {content, score, parents, children}

    def generate(self, prompt, parent_ids=None):
        """Generate a new thought from parent thoughts."""
        context = prompt
        if parent_ids:
            parent_content = [self.nodes[pid]["content"] for pid in parent_ids]
            context = f"Based on: {'; '.join(parent_content)}\n\n{prompt}"
        response = self.llm.generate(context)
        node_id = len(self.nodes)
        self.nodes[node_id] = {
            "content": response,
            "score": self.scorer(response),
            "parents": parent_ids or [],
            "children": []
        }
        for pid in (parent_ids or []):
            self.nodes[pid]["children"].append(node_id)
        return node_id

    def aggregate(self, node_ids, prompt):
        """Merge multiple thoughts into one."""
        contents = [self.nodes[nid]["content"] for nid in node_ids]
        merge_prompt = f"Combine these into a single coherent answer:\n\n"
        merge_prompt += "\n---\n".join(contents)
        merge_prompt += f"\n\n{prompt}"
        return self.generate(merge_prompt, parent_ids=node_ids)

    def backtrack(self, node_id):
        """Return to a node's parents, removing the failed path."""
        self.nodes[node_id]["score"] = -1  # mark as dead
        return self.nodes[node_id]["parents"]

# Example: combining three research perspectives
got = GraphOfThought(llm, scorer)
n1 = got.generate("Research the carbon impact of AI training from an energy perspective")
n2 = got.generate("Research the carbon impact of AI training from a hardware perspective")
n3 = got.generate("Research the carbon impact of AI training from a data center perspective")
result = got.aggregate([n1, n2, n3], "Synthesize a comprehensive carbon impact report")

The Scoring Problem

Every graph operation depends on scoring. Getting scores wrong means pruning good ideas or keeping bad ones.

Scoring approaches:

LLM-as-judge: Ask the model to rate thoughts 1-5 on relevance and quality. Consistent but expensive.
Heuristic scoring: Length, keyword presence, format compliance. Fast but shallow.
Reference-based: Compare to a known-good answer. Gold standard but requires ground truth.
Ensemble score: Average multiple LLM ratings for more reliable scoring.

Graph vs Tree: Cost Comparison

For a problem with branching factor 3 and depth 3:

Method	LLM Calls	Tokens	Best For
CoT	1	100%	Linear reasoning
ToT	3 × 3 = 9 (worst case)	~900%	Branch-evaluate-prune
GoT	3 + 1 = 4 (merge pattern)	~400%	Parallel + combine

GoT is cheaper than ToT for parallel decomposable problems because you skip the evaluation and pruning overhead on all-but-one branch.

Practical Limitations

Scoring is hard. The quality of aggregation depends entirely on the scorer's accuracy. Bad scores corrupt the whole graph.
Not production-ready. GoT is a research concept. Libraries like DSPy support it experimentally, but no production frameworks ship it.
Overhead for simple tasks. If CoT works, don't add graph complexity. Only use GoT when merging intermediate results demonstrably improves output.
Latency. Each node is an API call. A graph with 10 nodes is 10+ sequential or parallel calls. Budget accordingly.

Algorithm Design Prompts for ChatGPT | Problem Solving Guide

Learn how to write effective prompts for algorithm design tasks, from problem analysis to implementation strategies.

World Building Guide for Fiction Writers

Master the art of world building with ChatGPT prompts designed to help you create rich, immersive fictional universes with depth and consistency.

Exam Preparation Guide: ChatGPT Prompts for Success

Master exam preparation with these ChatGPT prompts designed to help you study effectively, create study schedules, and perform well in academic assessments.

Graph-of-Thought: Beyond Trees and Chains