Graph-of-Thought: Beyond Trees and Chains
Model reasoning as a directed graph with merging, branching, and backtracking. Take GoT beyond Chain-of-Thought and Tree-of-Thought for complex multi-source synthesis.
The Reasoning Spectrum
Prompting techniques form a progression in expressive power:
Chain-of-Thought (CoT) → linear, one path
↓
Tree-of-Thought (ToT) → branching, evaluate and prune
↓
Graph-of-Thought (GoT) → merging, backtracking, combining partial solutions
Graph-of-Thought (Besta et al. 2023) models reasoning as a directed graph where thoughts are nodes and dependencies are edges. This enables operations that linear chains and trees cannot express.
Core Operations
| Operation | What It Does | When You Need It |
|---|---|---|
| Generate | Create a new thought from existing ones | Standard step forward |
| Aggregate | Merge multiple thoughts into one | Combining partial results |
| Refine | Improve a thought based on feedback | Iterative improvement |
| Backtrack | Return to an earlier thought and try a different path | Dead-end recovery |
| Score | Evaluate thought quality for pruning | Quality control |
When GoT Beats ToT and CoT
GoT wins when partial solutions can be combined:
- Sorting problems: Breaking a list into sublists, sorting each, merging results. Sorting 64 numbers: 32-way split → sort individually → merge.
- Document synthesis: Multiple documents each cover part of a topic. Generate summaries → aggregate into unified report.
- Parallel sub-problems: A problem decomposes into independent sub-problems whose solutions combine additively.
ToT wins when branching is the primary need:
- Creative writing with multiple candidate outlines
- Game playing with multiple move sequences
- Planning with a clear evaluation function
CoT wins when the reasoning is linear:
- Simple arithmetic
- Step-by-step instructions
- Sequential deduction
Implementation Sketch
class GraphOfThought:
def __init__(self, llm, scorer):
self.llm = llm
self.scorer = scorer # function(thought) → score
self.nodes = {} # id → {content, score, parents, children}
def generate(self, prompt, parent_ids=None):
"""Generate a new thought from parent thoughts."""
context = prompt
if parent_ids:
parent_content = [self.nodes[pid]["content"] for pid in parent_ids]
context = f"Based on: {'; '.join(parent_content)}\n\n{prompt}"
response = self.llm.generate(context)
node_id = len(self.nodes)
self.nodes[node_id] = {
"content": response,
"score": self.scorer(response),
"parents": parent_ids or [],
"children": []
}
for pid in (parent_ids or []):
self.nodes[pid]["children"].append(node_id)
return node_id
def aggregate(self, node_ids, prompt):
"""Merge multiple thoughts into one."""
contents = [self.nodes[nid]["content"] for nid in node_ids]
merge_prompt = f"Combine these into a single coherent answer:\n\n"
merge_prompt += "\n---\n".join(contents)
merge_prompt += f"\n\n{prompt}"
return self.generate(merge_prompt, parent_ids=node_ids)
def backtrack(self, node_id):
"""Return to a node's parents, removing the failed path."""
self.nodes[node_id]["score"] = -1 # mark as dead
return self.nodes[node_id]["parents"]
# Example: combining three research perspectives
got = GraphOfThought(llm, scorer)
n1 = got.generate("Research the carbon impact of AI training from an energy perspective")
n2 = got.generate("Research the carbon impact of AI training from a hardware perspective")
n3 = got.generate("Research the carbon impact of AI training from a data center perspective")
result = got.aggregate([n1, n2, n3], "Synthesize a comprehensive carbon impact report")
The Scoring Problem
Every graph operation depends on scoring. Getting scores wrong means pruning good ideas or keeping bad ones.
Scoring approaches:
- LLM-as-judge: Ask the model to rate thoughts 1-5 on relevance and quality. Consistent but expensive.
- Heuristic scoring: Length, keyword presence, format compliance. Fast but shallow.
- Reference-based: Compare to a known-good answer. Gold standard but requires ground truth.
- Ensemble score: Average multiple LLM ratings for more reliable scoring.
Graph vs Tree: Cost Comparison
For a problem with branching factor 3 and depth 3:
| Method | LLM Calls | Tokens | Best For |
|---|---|---|---|
| CoT | 1 | 100% | Linear reasoning |
| ToT | 3 × 3 = 9 (worst case) | ~900% | Branch-evaluate-prune |
| GoT | 3 + 1 = 4 (merge pattern) | ~400% | Parallel + combine |
GoT is cheaper than ToT for parallel decomposable problems because you skip the evaluation and pruning overhead on all-but-one branch.
Practical Limitations
- Scoring is hard. The quality of aggregation depends entirely on the scorer's accuracy. Bad scores corrupt the whole graph.
- Not production-ready. GoT is a research concept. Libraries like DSPy support it experimentally, but no production frameworks ship it.
- Overhead for simple tasks. If CoT works, don't add graph complexity. Only use GoT when merging intermediate results demonstrably improves output.
- Latency. Each node is an API call. A graph with 10 nodes is 10+ sequential or parallel calls. Budget accordingly.
Related Articles
Mastering Technology Creation in Midjourney: Sci-Fi Gadgets, Robots & Futuristic Devices
Create stunning futuristic technology with Midjourney using advanced prompts, material techniques, and sci-fi aesthetics. Explore robots, holographic interfaces, energy systems, and advanced gadgets.
Mastering Artifact Creation in Midjourney: Mystical Objects, Relics & Ancient Treasures
Create stunning mystical and historical artifacts with Midjourney using advanced prompts, material techniques, and magical effects. Explore ancient relics, sacred objects, enchanted items, and legendary treasures.
Fantasy & Isekai SREF Codes for Midjourney
Epic fantasy worlds with detailed environments and RPG-inspired aesthetics for Midjourney prompts.