The Core Idea

Automatic Prompt Engineer (Zhou et al. 2022) flips the script: instead of humans writing prompts by trial and error, LLMs generate, evaluate, and select the best prompts. The key insight: prompt engineering is a search problem, and LLMs are good at both generating candidates and evaluating results.

Human prompt engineering:   Guess → Test → Guess again (hours/days)
APE:                        Generate N candidates → Score each → Pick best (minutes)

How APE Works

The APE loop has three stages:

Stage 1: Generate candidate prompts
  Input: A few input-output examples for the task
  LLM role: "Given these examples, what instruction would produce these outputs?"
  Output: N candidate instructions

Stage 2: Score each candidate
  For each candidate prompt, run it on a held-out set of examples
  Score using task metrics (accuracy, F1, exact match, etc.)
  Rank candidates by score

Stage 3: Select and optionally refine
  Pick the top-scoring prompt
  Optionally: feed top prompts back into generation for iterative refinement
  Output: final optimized prompt

APE In Code

def ape(llm, task_examples, eval_examples, num_candidates=10, score_fn=None):
    """Automatic Prompt Engineer: generate and select the best prompt."""

    # Stage 1: Generate candidate instructions
    demo_prompt = "Here are some input-output pairs:\n\n"
    for inp, out in task_examples:
        demo_prompt += f"Input: {inp}\nOutput: {out}\n\n"
    demo_prompt += "Generate a clear instruction that would produce"
    demo_prompt += " these outputs from these inputs."

    candidates = []
    for _ in range(num_candidates):
        instruction = llm.generate(demo_prompt, temperature=0.8)
        candidates.append(instruction)

    # Stage 2: Score each candidate on eval set
    if score_fn is None:
        score_fn = default_accuracy  # exact match or task-specific metric

    scores = {}
    for i, candidate in enumerate(candidates):
        correct = 0
        for inp, expected in eval_examples:
            result = llm.generate(f"{candidate}\n\nInput: {inp}")
            if score_fn(result, expected):
                correct += 1
        scores[i] = correct / len(eval_examples)

    # Stage 3: Select best
    best_idx = max(scores, key=scores.get)
    return {
        "best_prompt": candidates[best_idx],
        "score": scores[best_idx],
        "all_candidates": candidates,
        "all_scores": scores
    }

APE's Signature Discovery

APE's most famous result: it discovered a better zero-shot CoT prompt than the human-engineered "Let's think step by step."

APE's winner: "Let's work this out in a step by step way to be sure we have the right answer."

Prompt	MultiArith Accuracy	GSM8K Accuracy
"Let's think step by step" (human)	78.7%	40.7%
APE's winning prompt	82.0%	43.0%

The 3-4% gain may seem small, but it's free — no extra tokens, no extra calls, just a better prompt string.

Prompt Scoring Metrics

The scoring function determines what "better" means:

Metric	What It Measures	Best For
Exact match	Output == expected	Classification, factual QA
F1 / BLEU / ROUGE	Text overlap with reference	Summarization, translation
LLM-as-judge	Ask another LLM to rate output quality	Open-ended generation, writing
Human preference	Side-by-side human rating	Subjective quality (creative, dialogue)
Cost efficiency	Accuracy per dollar of API cost	Production optimization
Latency	Response time	Real-time applications

For production systems, multi-objective scoring is common: maximize accuracy while staying under a token budget.

The APE Ecosystem

APE is a research concept. Zhou et al. released a demo and the code for the original paper. Several practical tools extend similar ideas:

Tool	Approach	Key Feature
DSPy	Compiler-style optimization	Automatically tunes prompts and few-shot examples; multi-step pipeline optimization
OPRO (Google DeepMind)	LLM-driven optimization	Meta-prompt that iteratively suggests improved prompts based on score history
PromptBench	Systematic evaluation framework	Benchmarks prompt robustness across models, adversarial perturbations
AutoPrompt (Shin et al. 2020)	Gradient-guided search	Uses token gradients to find optimal trigger words
Prefix/Prompt Tuning	Continuous soft prompts	Learns prompt embeddings via backpropagation (requires training data)

DSPy is the most mature for production use. It treats prompt engineering as a compiler problem: you write a program signature, and DSPy optimizes the prompt + few-shot examples automatically.

When APE Wins

APE beats human-written prompts when:

You have a clear evaluation metric (accuracy, F1, exact match)
You have labeled examples to score candidates against
The task is well-defined and narrow in scope
You're optimizing for a specific model (prompts that work on GPT-4 may not work on Claude)

Human-written prompts win when:

You have deep domain expertise the LLM lacks
The task requires nuanced judgment (legal, medical)
You're designing for safety/guardrails the LLM won't self-impose
You need creative/stylistic control the LLM can't evaluate objectively
Zero labeled data — APE needs examples for scoring

The Meta-Prompting Paradox

Who prompts the prompt engineer? APE needs:

Demonstration examples — a human selects these
Scoring function — a human defines what "better" means
Candidate generation prompt — a human designs the meta-prompt that generates candidates
Selection criteria — a human decides between accuracy, cost, latency tradeoffs

APE automates the search but not the judgment. The human role shifts from writing prompts to defining success criteria and curating evaluation data.

Recursive APE (The Rabbit Hole)

Level 0: Human writes prompts
Level 1: APE optimizes prompts
Level 2: APE optimizes the meta-prompt that optimizes prompts
Level 3: APE optimizes the scoring function that evaluates prompts
...

In practice, stop at Level 1. The meta-prompt and scoring function are better designed by humans who understand the task context.

Cost Analysis

APE is computationally expensive but a one-time cost:

Stage	Cost	Notes
Generate candidates	N × 1 call	N=10-50 typically, temperature > 0 for diversity
Score candidates	N × M calls	M = number of eval examples, 50-200 typical
Refinement (optional)	K × (gen + score)	K iterations, diminishing returns after 2-3

Total: ~500-5,000 LLM calls for one prompt optimization. Worth it if the prompt will be used thousands of times.

Comparison With Other Prompt Optimization

Method	Mechanism	Requires Labeled Data?	Cost Per Optimization
APE	LLM generates + scores candidates	Yes (for scoring)	High (N×M calls)
DSPy	Compiler optimization with teleprompters	Yes (training set)	Medium (compiles once)
OPRO	Iterative meta-prompt refinement	No (uses self-evaluation)	Medium
Manual iteration	Human trial and error	No (human judgment)	Low compute, high human time
Gradient-based	Token-level gradient optimization	Yes (training data)	Low (single backward pass)

When to Skip APE

Your prompt is already performing at >95% of the theoretical ceiling
The task changes frequently (optimization cost > benefit)
You have fewer than 20 labeled examples for scoring
The prompt needs human safety review before deployment (APE can't guarantee safe outputs)

Automatic Prompt Engineering (APE)