Few-shot vs Zero-shot: Choosing the Right Strategy
Compare zero-shot, few-shot, and many-shot prompting. Learn when each works, how many examples to use, and the token cost tradeoffs across GPT-4, Claude, and Gemini.
The Example Spectrum
Every prompt exists on a spectrum: zero-shot (no examples), few-shot (some examples), many-shot (many examples). The right choice depends on task complexity, model capability, and your token budget.
Zero-Shot Prompting
The model performs a task with no examples — only instructions. This works because modern LLMs are instruction-tuned on massive datasets.
Classify the text as positive, negative, or neutral.
Text: "The delivery was late but the product works fine."
Sentiment:
When zero-shot works:
- Classification with clear labels
- Summarization and rewriting
- Simple extraction (dates, names, emails)
- Translation between common languages
- Instruction-following models (GPT-4, Claude 3.5, Gemini 1.5 Pro)
When zero-shot fails:
- Tasks requiring specific output formatting
- Domain-specific terminology the model hasn't seen
- Nuanced categorization with subtle distinctions
- Complex reasoning with multiple steps
Few-Shot Prompting
Provide 1-10 demonstrations in the prompt. The model learns the task pattern from examples — this is in-context learning.
Classify the text as positive, negative, or neutral.
Text: "Absolutely loved it, best purchase ever."
Sentiment: positive
Text: "Complete waste of money, broke after two days."
Sentiment: negative
Text: "It arrived on time, haven't tried it yet though."
Sentiment: neutral
Text: "Not what I expected but it's growing on me."
Sentiment:
Key findings from research (Min et al. 2022):
- The format matters more than label accuracy. Random labels in the right format outperform no examples.
- Input distribution matching matters — use examples from the same domain as your target.
- Label distribution matching helps too — if your real data is 70% positive, make your examples ~70% positive.
How many examples?
| Shots | Best For | Token Cost |
|---|---|---|
| 1-shot | Simple format tasks, when model already knows the domain | Minimal |
| 3-shot | Moderate classification, structured extraction | Low |
| 5-shot | Complex labeling, edge case coverage | Medium |
| 10-shot | Nuanced reasoning, multi-step tasks | High |
Many-Shot Prompting
With context windows now reaching 200K+ tokens, you can include 50-100+ examples. This was impractical a year ago but is now viable with prompt caching.
When many-shot beats few-shot:
- Highly specialized classification with many edge cases
- Low-resource languages where the model needs extensive grounding
- Complex multi-step reasoning where demonstrations build on each other
- When the model consistently fails on specific edge cases
The diminishing returns curve:
- 0 → 1 example: largest accuracy jump
- 1 → 5 examples: strong improvement
- 5 → 20 examples: moderate improvement
- 20 → 100+ examples: marginal improvement, high token cost
Making many-shot affordable:
- Prompt caching (Anthropic) gives 90% discount on cache hits. Put static examples first, dynamic input last.
- Use shorter examples — strip verbose descriptions, keep only input/output pairs.
- Use GPT-4o-mini or Claude Haiku for classification tasks where many-shot helps.
Decision Framework
Is this a classification/extraction task with clear labels?
├─ Yes → Try zero-shot first. Add 3-shot if accuracy < 90%.
└─ No → Is this complex reasoning?
├─ Yes → Use CoT or ToT instead of example-based prompting.
└─ No → Is this a format-dependent output task?
├─ Yes → Use 1-3 examples showing exact format.
└─ No → Is this a niche domain?
├─ Yes → Use 5-10 domain-specific examples.
└─ No → Try zero-shot with detailed instructions.
Provider Behavior
| Model | Zero-Shot Strength | Few-Shot Notes |
|---|---|---|
| GPT-4o | Excellent | Needs format consistency in examples |
| Claude 3.5 Sonnet | Very good | Excels with structured formatting examples |
| Gemini 1.5 Pro | Good | Prefers more examples for nuanced tasks |
| GPT-4o-mini | Moderate | Often needs 3+ examples for reliability |
| Claude Haiku | Moderate | Benefits from clear format demonstrations |
Token Cost Comparison
Typical cost for a classification task (100 input tokens, 10 output tokens):
| Strategy | Tokens In | Cost per Call (GPT-4o) | Relative |
|---|---|---|---|
| Zero-shot | 100 | $0.00025 | 1x |
| 3-shot | 400 | $0.001 | 4x |
| 10-shot | 1,100 | $0.00275 | 11x |
| 50-shot | 5,100 | $0.01275 | 51x |
With prompt caching (cached examples), the 3-shot cost drops to ~$0.0004 — nearly zero-shot pricing.
Related Articles
DeepSeek Data Extraction: High-Volume JSON Pipelines
Leverage DeepSeek's JSON mode and aggressive pricing for massive data extraction. Cache-aware batch design, retry patterns for empty JSON output, and production extraction pipeline architecture.
Claude Academic Research Assistant: Literature Review & Methodology
Leverage Claude's 200K context for academic research. Prompts for literature review synthesis, citation management, hypothesis generation, and methodology design across multiple papers.
Claude 200K Context: Strategies for Long Documents
Master Claude's 200K context window. Learn to structure prompts for massive inputs, retrieve specific information from long documents, and optimize costs when using extended context.