Few-shot vs Zero-shot: Choosing the Right Strategy

Compare zero-shot, few-shot, and many-shot prompting. Learn when each works, how many examples to use, and the token cost tradeoffs across GPT-4, Claude, and Gemini.

June 10, 2026
few-shotzero-shotmany-shotin-context-learningprompt-engineering

The Example Spectrum

Every prompt exists on a spectrum: zero-shot (no examples), few-shot (some examples), many-shot (many examples). The right choice depends on task complexity, model capability, and your token budget.

Zero-Shot Prompting

The model performs a task with no examples — only instructions. This works because modern LLMs are instruction-tuned on massive datasets.

Classify the text as positive, negative, or neutral.

Text: "The delivery was late but the product works fine."
Sentiment:

When zero-shot works:

  • Classification with clear labels
  • Summarization and rewriting
  • Simple extraction (dates, names, emails)
  • Translation between common languages
  • Instruction-following models (GPT-4, Claude 3.5, Gemini 1.5 Pro)

When zero-shot fails:

  • Tasks requiring specific output formatting
  • Domain-specific terminology the model hasn't seen
  • Nuanced categorization with subtle distinctions
  • Complex reasoning with multiple steps

Few-Shot Prompting

Provide 1-10 demonstrations in the prompt. The model learns the task pattern from examples — this is in-context learning.

Classify the text as positive, negative, or neutral.

Text: "Absolutely loved it, best purchase ever."
Sentiment: positive

Text: "Complete waste of money, broke after two days."
Sentiment: negative

Text: "It arrived on time, haven't tried it yet though."
Sentiment: neutral

Text: "Not what I expected but it's growing on me."
Sentiment:

Key findings from research (Min et al. 2022):

  • The format matters more than label accuracy. Random labels in the right format outperform no examples.
  • Input distribution matching matters — use examples from the same domain as your target.
  • Label distribution matching helps too — if your real data is 70% positive, make your examples ~70% positive.

How many examples?

ShotsBest ForToken Cost
1-shotSimple format tasks, when model already knows the domainMinimal
3-shotModerate classification, structured extractionLow
5-shotComplex labeling, edge case coverageMedium
10-shotNuanced reasoning, multi-step tasksHigh

Many-Shot Prompting

With context windows now reaching 200K+ tokens, you can include 50-100+ examples. This was impractical a year ago but is now viable with prompt caching.

When many-shot beats few-shot:

  • Highly specialized classification with many edge cases
  • Low-resource languages where the model needs extensive grounding
  • Complex multi-step reasoning where demonstrations build on each other
  • When the model consistently fails on specific edge cases

The diminishing returns curve:

  • 0 → 1 example: largest accuracy jump
  • 1 → 5 examples: strong improvement
  • 5 → 20 examples: moderate improvement
  • 20 → 100+ examples: marginal improvement, high token cost

Making many-shot affordable:

  • Prompt caching (Anthropic) gives 90% discount on cache hits. Put static examples first, dynamic input last.
  • Use shorter examples — strip verbose descriptions, keep only input/output pairs.
  • Use GPT-4o-mini or Claude Haiku for classification tasks where many-shot helps.

Decision Framework

Is this a classification/extraction task with clear labels?
  ├─ Yes → Try zero-shot first. Add 3-shot if accuracy < 90%.
  └─ No → Is this complex reasoning?
           ├─ Yes → Use CoT or ToT instead of example-based prompting.
           └─ No → Is this a format-dependent output task?
                    ├─ Yes → Use 1-3 examples showing exact format.
                    └─ No → Is this a niche domain?
                             ├─ Yes → Use 5-10 domain-specific examples.
                             └─ No → Try zero-shot with detailed instructions.

Provider Behavior

ModelZero-Shot StrengthFew-Shot Notes
GPT-4oExcellentNeeds format consistency in examples
Claude 3.5 SonnetVery goodExcels with structured formatting examples
Gemini 1.5 ProGoodPrefers more examples for nuanced tasks
GPT-4o-miniModerateOften needs 3+ examples for reliability
Claude HaikuModerateBenefits from clear format demonstrations

Token Cost Comparison

Typical cost for a classification task (100 input tokens, 10 output tokens):

StrategyTokens InCost per Call (GPT-4o)Relative
Zero-shot100$0.000251x
3-shot400$0.0014x
10-shot1,100$0.0027511x
50-shot5,100$0.0127551x

With prompt caching (cached examples), the 3-shot cost drops to ~$0.0004 — nearly zero-shot pricing.