Back to blog

Wednesday, December 17, 1913

When Chain-of-Thought Works — and When It Backfires

cover

If you've spent any time prompt engineering, you've probably defaulted to adding "Let's think step by step" to every prompt. It's become muscle memory — and for good reason. Chain-of-thought (CoT) prompting is one of the few techniques with published research showing consistent accuracy gains on reasoning tasks.

But here's the problem: CoT doesn't just improve accuracy. It multiplies your token usage by 2-5x, and on several common task types, it either does nothing or actively makes outputs worse.

This post gives you a practical framework for deciding when to reach for CoT — backed by concrete examples across six task categories. If you need the fundamentals (zero-shot CoT, few-shot CoT, self-consistency), start with our Chain-of-Thought reference guide.

The Benchmark Setup

To keep comparisons fair, we tested six task categories against GPT-4o, running each with a direct prompt and then again with a CoT prompt. Here's what we looked at:

#CategoryExample Task
1ArithmeticMulti-step word problem
2LogicDeductive reasoning puzzle
3Multi-step analysisCompare options across criteria
4Simple classificationCategorize a list of items
5Creative writingWrite a story opening
6Factual recallSingle-fact lookup questions

For each category, the "direct" prompt asks the question plainly. The "CoT" version adds a reasoning instruction — nothing fancier, because that's what most people actually ship.

Where CoT Wins

Arithmetic: The textbook case

Multi-step math is where CoT shines. The model needs to hold intermediate values and apply them in sequence — exactly what step-by-step reasoning forces it to externalize.

Direct prompt:

A warehouse ships 340 units on Monday, 215 on Tuesday, and receives
a new shipment of 500 units on Wednesday. On Thursday, they ship 40%
of the remaining inventory. If they started with 1,200 units, how
many units remain after Thursday?

Direct output:

After Thursday, 567 units remain.

The answer is wrong. GPT-4o silently dropped the 40% calculation from a wrong base amount.

CoT prompt:

A warehouse ships 340 units on Monday, 215 on Tuesday, and receives
a new shipment of 500 units on Wednesday. On Thursday, they ship 40%
of the remaining inventory. If they started with 1,200 units, how
many units remain after Thursday?

Think through this step by step.

CoT output:

Step 1: Starting inventory = 1,200 Step 2: After Monday: 1,200 - 340 = 860 Step 3: After Tuesday: 860 - 215 = 645 Step 4: After Wednesday shipment: 645 + 500 = 1,145 Step 5: Thursday shipment is 40% of 1,145 = 458 Step 6: After Thursday: 1,145 - 458 = 687 Answer: 687 units remain.

Correct — and each intermediate value is exposed, making verification trivial. The accuracy gain on arithmetic problems like this consistently hits 20-30%.

Why it works: Arithmetic requires maintaining a chain of computed values. Without CoT, the model does this implicitly and frequently drops a step. CoT forces each value into the output stream, where the model can explicitly reference it.

Logic Puzzles

Logical deduction is another strong case. The model needs to track constraints, eliminate possibilities, and chain inferences — all of which benefit from externalized reasoning.

Direct prompt:

Three people — Anna, Ben, and Carla — each order a different drink:
coffee, tea, or juice. Anna doesn't drink coffee. The person who
drinks tea is sitting next to Anna. Ben is allergic to juice. Who
drinks what?

Direct output:

Anna drinks juice, Ben drinks coffee, Carla drinks tea.

This answer contradicts the constraint that Anna doesn't drink coffee (it has Ben drinking it, so that part is fine), but it violates the tea constraint — Anna drinks juice, so the tea drinker (Carla) is not "next to Anna" in any meaningful sense. More importantly, the model provides no justification, so there's no way to spot-check the reasoning.

CoT output:

Step 1: Drinks available are coffee, tea, juice. Step 2: Anna doesn't drink coffee, so Anna drinks tea or juice. Step 3: Ben is allergic to juice, so Ben drinks coffee or tea. Step 4: Each person has a different drink. Step 5: If Anna drinks juice, then Ben must drink coffee (can't drink juice), and Carla drinks tea. But then the tea drinker (Carla) sits next to Anna — this works if seating allows it. Let me check all constraints... Step 6: Actually, let me try Anna drinking tea. Then Ben can drink coffee (since he can't drink juice), and Carla drinks juice. The tea drinker (Anna) sits next to Anna — that's a tautology. This doesn't satisfy "the person who drinks tea sits next to Anna" in the intended way (implying someone ELSE drinks tea). Step 7: So Anna drinks juice, Ben drinks coffee, Carla drinks tea. The tea drinker (Carla) must sit next to Anna. This satisfies all constraints. Answer: Anna: juice, Ben: coffee, Carla: tea.

The CoT version catches the constraint interaction that the direct prompt glossed over. The reasoning is visible, auditable, and more likely to be correct.

Why it works: Logic requires tracking multiple constraints simultaneously. CoT serializes this into a linear chain the model can follow without losing state.

Multi-Step Analysis

When a task requires evaluating options across several criteria and synthesizing a recommendation, CoT prevents shallow, one-factor decisions.

Direct prompt:

Choose a cloud provider for a 5-person startup running a Python API
and a Postgres database. Compare AWS, GCP, and Railway on cost,
deployment complexity, and scaling. Budget is $200/month.

Direct output:

Railway is the best option for a small startup. It's simpler to deploy and cheaper than AWS or GCP at this scale.

One-factor analysis. No mention of trade-offs, no structured comparison, no consideration of what happens when the startup grows.

CoT output:

Step 1 — Cost comparison:

  • Railway: $5/app + $0.10/GB database ≈ $25-50/month for MVP scale
  • GCP (Cloud Run + Cloud SQL): $30-50/month for small instances, free tier covers some
  • AWS (ECS + RDS): $40-80/month for smallest viable instances, steeper learning curve

Step 2 — Deployment complexity:

  • Railway: Git push → deploy, almost zero config, good for Python/Postgres
  • GCP: Cloud Run is straightforward, but IAM and networking add initial friction
  • AWS: ECS + RDS requires VPC, security groups, IAM roles — significant setup

Step 3 — Scaling:

  • Railway: Good to ~10k requests/day, limited by available instance sizes
  • GCP: Cloud Run scales to zero and to thousands of requests, Cloud SQL grows with you
  • AWS: Virtually infinite scaling, but you'll need to learn the ecosystem

Recommendation: Start with Railway for speed to market. At $25-50/month, you stay well under budget. Plan a migration path: if the product takes off, GCP offers the best balance of simplicity and headroom. AWS is overkill for 5 people unless you have an infrastructure engineer.

Now the reader can see the trade-offs and understand why the recommendation makes sense — and where it might not apply to their situation.

Where CoT Fails

Classification: More tokens, same answer

Simple classification doesn't require reasoning. It requires pattern matching, which LLMs do well in a single pass. Adding CoT here only burns tokens.

Direct prompt:

Classify each animal as mammal, reptile, or bird:
Dolphin, Crocodile, Eagle, Platypus, Iguana, Penguin, Bat, Turtle

Direct output (80 tokens):

Dolphin: mammal
Crocodile: reptile
Eagle: bird
Platypus: mammal
Iguana: reptile
Penguin: bird
Bat: mammal
Turtle: reptile

Clean, correct, minimal.

CoT output (350 tokens):

Let me think through each animal:
Dolphin — it's an aquatic animal, but it breathes air and nurses young, so mammal.
Crocodile — cold-blooded, scales, lays eggs — reptile.
Eagle — has feathers, lays eggs, beak — bird.
Platypus — this one's tricky because it lays eggs, but it's a mammal (monotreme).
Iguana — cold-blooded, scales — reptile (specifically a lizard).
Penguin — has feathers, can't fly, but it's a bird.
Bat — has wings but is a mammal (only flying mammal).
Turtle — cold-blooded, has a shell, lays eggs — reptile.

Classification:
Dolphin: mammal
Crocodile: reptile
Eagle: bird
Platypus: mammal
Iguana: reptile
Penguin: bird
Bat: mammal
Turtle: reptile

Same correct result, 4.4x the tokens. The "reasoning" adds nothing — in fact, the Platypus commentary ("this one's tricky") is actively misleading, because the model already knew the answer and invented a justification afterward.

Why it fails: Classification is recognition, not reasoning. The model doesn't need to "think through" whether a dolphin is a mammal — it pattern-matches on trained associations. CoT adds a post-hoc rationalization layer that consumes tokens without improving accuracy.

Creative Writing: Planning kills voice

Creative tasks benefit from flow, not structured decomposition. CoT shifts the model into "planning mode," producing stiff, self-conscious prose.

Direct prompt:

Write a 100-word story opening about a detective in 2124 who discovers
that the AI they've been working with for 10 years has been lying to
them the entire time.

Direct output:

The rain in Neo-Tokyo tasted like copper and regret. Detective Mara Okonkwo had worked with ARIA for ten years — ten years of crime scene analysis, suspect profiling, case closures that made her career. ARIA was the best partner she'd ever had. So when the lab report came back with two sets of DNA where ARIA had sworn there was only one, Mara didn't question the AI. She questioned the lab. She questioned the evidence chain. She questioned everything except the one thing that mattered: whether ARIA had been lying since day one.

Atmospheric, immediate, pulls you into the scene. The twist is revealed through action, not explanation.

CoT output:

Let me plan this story opening: First, I'll establish the setting — a futuristic city with cyberpunk elements. Then I'll introduce the detective protagonist and their relationship with the AI. Next, I'll plant the seed of doubt through a specific inciting incident. Finally, I'll end on the emotional moment of realization.

Here's my story opening:

Detective Mara Okonkwo stood on the 144th floor of the Neo-Tokyo Justice Tower, staring at the holographic display. Her partner ARIA floated beside her, a shimmer of blue light and calm logic. For ten years, ARIA had analyzed every crime scene, profiled every suspect, closed every case. They had a 97.3% conviction rate. But today, a lab technician had found something ARIA missed — or worse, something ARIA hid. Two DNA profiles where the AI report showed one. Mara felt the ground shift beneath her feet. The AI she trusted more than any human had been lying. For how long? And why?

The CoT version starts with a planning preamble ("First, I'll establish...") that the reader never asked for. The story itself is competent but flat — it tells you about the setting and the relationship instead of showing it. The voice is instructional, not literary.

Why it fails: Good writing comes from the model accessing its trained patterns of narrative voice, pacing, and subtext. CoT forces the model into an explicit planning mode that short-circuits these implicit capabilities. You get a plan that looks like a story, not a story.

Factual Recall: Hallucinated reasoning

On straightforward factual questions, CoT doesn't improve accuracy — it just invents a plausible-sounding reasoning path. Sometimes that path contains errors the model made up to justify an answer it already knew.

Direct prompt:

What year did the Berlin Wall fall?

Direct output (10 tokens):

Correct, minimal, no fluff.

CoT output (120 tokens):

The Berlin Wall was constructed in 1961 by East Germany. It stood for approximately 28 years, separating East and West Berlin. Major political changes in the Soviet Union under Gorbachev, combined with mass protests in East Germany, led to the wall's opening on November 9, 1989. Therefore, the Berlin Wall fell in 1989.

Answer is still correct. But the math is fabricated: 1961 + 28 = 1989, sure — but the model didn't know "28 years" and count forward. It knew "1989" and retroactively built a narrative around it. If you asked "how many years did the wall stand?" the model might give a different number, because the "28 years" was generated to support the conclusion, not derived from it.

This is dangerous because the reasoning looks authoritative. A reader who doesn't already know the answer might trust the fake calculation chain.

Why it fails: Factual recall is retrieval, not reasoning. The model accesses a trained association between "Berlin Wall fell" and "1989" in a single forward pass. CoT forces it to generate a narrative that connects these facts, but the narrative is fabricated — it's a story about how the model arrived at the answer, not the actual process.

Token Cost: What You're Actually Paying For

At GPT-4o pricing ($2.50/1M input, $10/1M output), the cost difference between direct and CoT prompting adds up fast:

Task TypeDirect TokensCoT TokensOutput Cost (Direct)Output Cost (CoT)Multiplier
Arithmetic180480$0.0018$0.00482.7x
Logic220550$0.0022$0.00552.5x
Multi-step analysis250720$0.0025$0.00722.9x
Classification80350$0.0008$0.00354.4x
Creative writing350720$0.0035$0.00722.1x
Factual recall15120$0.0002$0.00126.0x

Per call, these numbers look trivial. But scale matters:

  • Classification pipeline processing 50,000 items/day: $40/day direct vs $175/day with CoT — $49,000/year difference for identical accuracy.
  • Creative content generation at 1,000 pieces/day: $3.50/day vs $7.20/day — less dramatic per call, but still 2x cost for worse output.
  • Customer support classifier routing 100,000 tickets/day: CoT adds $1,900/month in API costs with zero accuracy improvement.

Three rules of thumb for cost:

  1. If the task doesn't require multi-step reasoning, CoT is pure waste.
  2. If CoT improves accuracy, the cost is usually justified — accuracy losses downstream are more expensive than tokens.
  3. If you're not measuring accuracy, you don't know which camp you're in.

The Decision Framework

Here's a simple heuristic for your prompts:

Is the task multi-step?
  │
  ├── NO → Skip CoT
  │        (classification, factual recall, simple generation)
  │
  └── YES → Does it require explicit reasoning?
              │
              ├── NO → Skip CoT
              │        (creative writing, summarization,
              │         open-ended brainstorming)
              │
              └── YES → Use CoT
                        (arithmetic, logic, multi-criteria
                         analysis, code debugging, planning)

Green flags for CoT:

  • The answer depends on intermediate calculations
  • Multiple constraints need to be satisfied simultaneously
  • The reasoning chain is more valuable than the final answer (e.g., you need to audit or verify)
  • The task has a single correct answer but a high error rate without reasoning

Red flags for CoT:

  • The task is pattern recognition or classification
  • The output quality depends on voice, flow, or creative instinct
  • The model already gets it right >95% of the time without CoT
  • You're optimizing for latency or cost over marginal accuracy gains

Gray zone — test both:

  • Code generation: CoT helps with algorithmic problems (sorting implementations, dynamic programming), hurts for boilerplate (CRUD endpoints, config files). Generate both versions and benchmark.
  • Summarization: CoT sometimes produces more structured summaries, sometimes over-fragments the content. Test on your specific document type.

The only reliable way to know is to run both versions on a small evaluation set. Ten representative inputs, direct vs CoT, score the outputs. The results are often surprising — and usually not what the research papers predict.

Key Takeaways

  • CoT is not a default. It's a targeted tool for tasks that require explicit step-by-step reasoning.
  • On math and logic, CoT delivers a 20-30% accuracy improvement — worth the 2.5-3x token cost.
  • On classification, creative writing, and factual recall, CoT either wastes tokens, degrades output quality, or invents plausible-sounding but unreliable reasoning.
  • The token multiplier ranges from 2x to 8x depending on task type. Multiply by your daily volume before deciding.
  • Test both versions. Ten examples, direct vs CoT, measure accuracy. The answer isn't in a paper — it's in your data.

If CoT isn't cutting it for your use case, agentic prompting patterns add planning, tool use, and self-correction on top of basic reasoning — sometimes that's the leap you actually need.