Automatic Prompt Engineering (APE)
Use LLMs to generate, score, and optimize prompts for other LLMs. APE discovered a better CoT prompt than humans did — and the same principles apply to your production prompts.
The Core Idea
Automatic Prompt Engineer (Zhou et al. 2022) flips the script: instead of humans writing prompts by trial and error, LLMs generate, evaluate, and select the best prompts. The key insight: prompt engineering is a search problem, and LLMs are good at both generating candidates and evaluating results.
Human prompt engineering: Guess → Test → Guess again (hours/days)
APE: Generate N candidates → Score each → Pick best (minutes)
How APE Works
The APE loop has three stages:
Stage 1: Generate candidate prompts
Input: A few input-output examples for the task
LLM role: "Given these examples, what instruction would produce these outputs?"
Output: N candidate instructions
Stage 2: Score each candidate
For each candidate prompt, run it on a held-out set of examples
Score using task metrics (accuracy, F1, exact match, etc.)
Rank candidates by score
Stage 3: Select and optionally refine
Pick the top-scoring prompt
Optionally: feed top prompts back into generation for iterative refinement
Output: final optimized prompt
APE In Code
def ape(llm, task_examples, eval_examples, num_candidates=10, score_fn=None):
"""Automatic Prompt Engineer: generate and select the best prompt."""
# Stage 1: Generate candidate instructions
demo_prompt = "Here are some input-output pairs:\n\n"
for inp, out in task_examples:
demo_prompt += f"Input: {inp}\nOutput: {out}\n\n"
demo_prompt += "Generate a clear instruction that would produce"
demo_prompt += " these outputs from these inputs."
candidates = []
for _ in range(num_candidates):
instruction = llm.generate(demo_prompt, temperature=0.8)
candidates.append(instruction)
# Stage 2: Score each candidate on eval set
if score_fn is None:
score_fn = default_accuracy # exact match or task-specific metric
scores = {}
for i, candidate in enumerate(candidates):
correct = 0
for inp, expected in eval_examples:
result = llm.generate(f"{candidate}\n\nInput: {inp}")
if score_fn(result, expected):
correct += 1
scores[i] = correct / len(eval_examples)
# Stage 3: Select best
best_idx = max(scores, key=scores.get)
return {
"best_prompt": candidates[best_idx],
"score": scores[best_idx],
"all_candidates": candidates,
"all_scores": scores
}
APE's Signature Discovery
APE's most famous result: it discovered a better zero-shot CoT prompt than the human-engineered "Let's think step by step."
APE's winner: "Let's work this out in a step by step way to be sure we have the right answer."
| Prompt | MultiArith Accuracy | GSM8K Accuracy |
|---|---|---|
| "Let's think step by step" (human) | 78.7% | 40.7% |
| APE's winning prompt | 82.0% | 43.0% |
The 3-4% gain may seem small, but it's free — no extra tokens, no extra calls, just a better prompt string.
Prompt Scoring Metrics
The scoring function determines what "better" means:
| Metric | What It Measures | Best For |
|---|---|---|
| Exact match | Output == expected | Classification, factual QA |
| F1 / BLEU / ROUGE | Text overlap with reference | Summarization, translation |
| LLM-as-judge | Ask another LLM to rate output quality | Open-ended generation, writing |
| Human preference | Side-by-side human rating | Subjective quality (creative, dialogue) |
| Cost efficiency | Accuracy per dollar of API cost | Production optimization |
| Latency | Response time | Real-time applications |
For production systems, multi-objective scoring is common: maximize accuracy while staying under a token budget.
The APE Ecosystem
APE is a research concept. Zhou et al. released a demo and the code for the original paper. Several practical tools extend similar ideas:
| Tool | Approach | Key Feature |
|---|---|---|
| DSPy | Compiler-style optimization | Automatically tunes prompts and few-shot examples; multi-step pipeline optimization |
| OPRO (Google DeepMind) | LLM-driven optimization | Meta-prompt that iteratively suggests improved prompts based on score history |
| PromptBench | Systematic evaluation framework | Benchmarks prompt robustness across models, adversarial perturbations |
| AutoPrompt (Shin et al. 2020) | Gradient-guided search | Uses token gradients to find optimal trigger words |
| Prefix/Prompt Tuning | Continuous soft prompts | Learns prompt embeddings via backpropagation (requires training data) |
DSPy is the most mature for production use. It treats prompt engineering as a compiler problem: you write a program signature, and DSPy optimizes the prompt + few-shot examples automatically.
When APE Wins
APE beats human-written prompts when:
- You have a clear evaluation metric (accuracy, F1, exact match)
- You have labeled examples to score candidates against
- The task is well-defined and narrow in scope
- You're optimizing for a specific model (prompts that work on GPT-4 may not work on Claude)
Human-written prompts win when:
- You have deep domain expertise the LLM lacks
- The task requires nuanced judgment (legal, medical)
- You're designing for safety/guardrails the LLM won't self-impose
- You need creative/stylistic control the LLM can't evaluate objectively
- Zero labeled data — APE needs examples for scoring
The Meta-Prompting Paradox
Who prompts the prompt engineer? APE needs:
- Demonstration examples — a human selects these
- Scoring function — a human defines what "better" means
- Candidate generation prompt — a human designs the meta-prompt that generates candidates
- Selection criteria — a human decides between accuracy, cost, latency tradeoffs
APE automates the search but not the judgment. The human role shifts from writing prompts to defining success criteria and curating evaluation data.
Recursive APE (The Rabbit Hole)
Level 0: Human writes prompts
Level 1: APE optimizes prompts
Level 2: APE optimizes the meta-prompt that optimizes prompts
Level 3: APE optimizes the scoring function that evaluates prompts
...
In practice, stop at Level 1. The meta-prompt and scoring function are better designed by humans who understand the task context.
Cost Analysis
APE is computationally expensive but a one-time cost:
| Stage | Cost | Notes |
|---|---|---|
| Generate candidates | N × 1 call | N=10-50 typically, temperature > 0 for diversity |
| Score candidates | N × M calls | M = number of eval examples, 50-200 typical |
| Refinement (optional) | K × (gen + score) | K iterations, diminishing returns after 2-3 |
Total: ~500-5,000 LLM calls for one prompt optimization. Worth it if the prompt will be used thousands of times.
Comparison With Other Prompt Optimization
| Method | Mechanism | Requires Labeled Data? | Cost Per Optimization |
|---|---|---|---|
| APE | LLM generates + scores candidates | Yes (for scoring) | High (N×M calls) |
| DSPy | Compiler optimization with teleprompters | Yes (training set) | Medium (compiles once) |
| OPRO | Iterative meta-prompt refinement | No (uses self-evaluation) | Medium |
| Manual iteration | Human trial and error | No (human judgment) | Low compute, high human time |
| Gradient-based | Token-level gradient optimization | Yes (training data) | Low (single backward pass) |
When to Skip APE
- Your prompt is already performing at >95% of the theoretical ceiling
- The task changes frequently (optimization cost > benefit)
- You have fewer than 20 labeled examples for scoring
- The prompt needs human safety review before deployment (APE can't guarantee safe outputs)
Related Articles
Essay Structure
Learn how to organize and structure your academic essays effectively with these ChatGPT prompts.
CoT vs Extended Thinking: When to Use Which
Compare chain-of-thought prompting vs Claude's extended thinking. Understand performance differences, use cases for each approach, and hybrid strategies that combine both for optimal results.
Creating Effective Claude Artifacts: Trigger & Specify
Learn to consistently trigger Artifact generation in Claude. Specify artifact types (code, docs, diagrams, React components, SVGs), craft prompts that produce high-quality first-draft artifacts, and avoid generic template output.