Self-Consistency: Improving Reasoning Through Majority Voting

Sample multiple reasoning paths and vote on the best answer. Boost accuracy on arithmetic, commonsense, and symbolic tasks with this technique wrapper for Chain-of-Thought.

June 10, 2026
self-consistencyreasoningchain-of-thoughtvotingprompt-engineering

The Core Idea

Self-consistency (Wang et al. 2022) replaces greedy decoding with diverse sampling. Instead of taking one reasoning path, generate multiple paths and select the most consistent answer. Errors tend to be unique; correct answers converge.

Standard CoT:  Model → One reasoning path → One answer (greedy, risk of error)
Self-Consistency: Model → 5-10 reasoning paths → Majority vote → Most reliable answer

When Single-Path Reasoning Fails

Question: When I was 6 my sister was half my age. Now I'm 70,
          how old is my sister?

Single CoT output:
"When I was 6, my sister was half my age = 3.
Now I'm 70, so she's 70 / 2 = 35."
→ WRONG (correct answer: 67)

With self-consistency, you generate multiple paths:

Path 1: "Sister was 3 when I was 6, age difference is 3 years.
        At 70, sister is 70 - 3 = 67." → Answer: 67 ✓

Path 2: "Sister was half my age at 6 = 3 years old.
        70 - 6 = 64 years passed. She's 3 + 64 = 67." → Answer: 67 ✓

Path 3: "Half of 70 is 35." → Answer: 35 ✗

Result: 67 appears twice, 35 appears once → Final answer: 67 ✓

Implementation

import asyncio
from collections import Counter

async def self_consistency(model, prompt, n_samples=5, temperature=0.7):
    """Generate N reasoning paths and return the majority answer."""
    responses = await asyncio.gather(*[
        model.generate(prompt, temperature=temperature, max_tokens=500)
        for _ in range(n_samples)
    ])

    # Extract final answers from reasoning paths
    answers = [extract_final_answer(r) for r in responses]

    # Majority vote
    counts = Counter(answers)
    best_answer, votes = counts.most_common(1)[0]
    confidence = votes / n_samples

    return {
        "answer": best_answer,
        "confidence": confidence,
        "all_answers": answers,
        "reasoning_paths": responses
    }

def extract_final_answer(text: str) -> str:
    """Extract the final answer from a reasoning chain.
    Looks for patterns like 'The answer is X' or 'Therefore, X'."""
    import re
    patterns = [
        r'(?:answer is|therefore|conclusion:)\s*(.+)',
        r'(?:^|\n)(\d+)\s*$'  # Last line is just a number
    ]
    for pattern in patterns:
        matches = re.findall(pattern, text.lower())
        if matches:
            return matches[-1].strip()
    return text.strip().split('\n')[-1]

Aggregation Strategies

MethodHow It WorksBest For
Majority voteCount exact answer matchesDiscrete answers (numbers, categories)
Weighted voteWeight by reasoning chain quality scoreWhen you have a confidence evaluator
Span extractionFind overlapping answer spans across responsesFree-text answers
LLM aggregatorAsk another LLM call to synthesize all pathsComplex multi-faceted answers

Temperature and Sampling

Temperature controls diversity. Higher = more diverse paths, but also more noise.

TemperatureDiversityAccuracy ImpactBest For
0.0DeterministicNo gain (same path each time)Never use for self-consistency
0.3-0.5Low diversitySmall gainsSimple arithmetic
0.5-0.7Moderate diversityBest balanceMost reasoning tasks
0.7-1.0High diversityRisk of noise overwhelming signalComplex open-ended reasoning

When Self-Consistency Helps

Strong gains on:

  • Arithmetic reasoning (GSM8K, MATH datasets)
  • Commonsense reasoning (StrategyQA, CommonsenseQA)
  • Symbolic reasoning (date arithmetic, logical deduction)

Weak or no gains on:

  • Factual recall (the model either knows it or doesn't)
  • Simple classification (paths all converge to same answer)
  • Tasks where the model is fundamentally wrong 100% of the time
  • Creative writing (no single "correct" answer)

Cost Analysis

Self-consistency multiplies token costs linearly. Every sample is a full API call.

SamplesRelative CostTypical Accuracy Gain
1 (baseline)1x-
33x+10-15%
55x+15-20%
1010x+20-25% (diminishing returns beyond 10)

When the cost is worth it:

  • High-stakes decisions where accuracy matters more than cost
  • Automated pipelines where you can batch process
  • One-time analysis tasks (research, legal review)

Combining With Other Techniques

Self-consistency wraps around other prompting strategies — it's not a replacement.

  • CoT + Self-Consistency: The standard combination. Generate CoT chains, vote on answers.
  • ToT + Self-Consistency: Generate multiple trees, vote on final root nodes.
  • Few-Shot + Self-Consistency: Use few-shot examples to improve individual path quality, then vote.