Automatic Prompt Engineering (APE)

Use LLMs to generate, score, and optimize prompts for other LLMs. APE discovered a better CoT prompt than humans did — and the same principles apply to your production prompts.

June 11, 2026
apeautomatic-prompt-engineeringprompt-optimizationdspyprompt-engineeringevaluation

The Core Idea

Automatic Prompt Engineer (Zhou et al. 2022) flips the script: instead of humans writing prompts by trial and error, LLMs generate, evaluate, and select the best prompts. The key insight: prompt engineering is a search problem, and LLMs are good at both generating candidates and evaluating results.

Human prompt engineering:   Guess → Test → Guess again (hours/days)
APE:                        Generate N candidates → Score each → Pick best (minutes)

How APE Works

The APE loop has three stages:

Stage 1: Generate candidate prompts
  Input: A few input-output examples for the task
  LLM role: "Given these examples, what instruction would produce these outputs?"
  Output: N candidate instructions

Stage 2: Score each candidate
  For each candidate prompt, run it on a held-out set of examples
  Score using task metrics (accuracy, F1, exact match, etc.)
  Rank candidates by score

Stage 3: Select and optionally refine
  Pick the top-scoring prompt
  Optionally: feed top prompts back into generation for iterative refinement
  Output: final optimized prompt

APE In Code

def ape(llm, task_examples, eval_examples, num_candidates=10, score_fn=None):
    """Automatic Prompt Engineer: generate and select the best prompt."""

    # Stage 1: Generate candidate instructions
    demo_prompt = "Here are some input-output pairs:\n\n"
    for inp, out in task_examples:
        demo_prompt += f"Input: {inp}\nOutput: {out}\n\n"
    demo_prompt += "Generate a clear instruction that would produce"
    demo_prompt += " these outputs from these inputs."

    candidates = []
    for _ in range(num_candidates):
        instruction = llm.generate(demo_prompt, temperature=0.8)
        candidates.append(instruction)

    # Stage 2: Score each candidate on eval set
    if score_fn is None:
        score_fn = default_accuracy  # exact match or task-specific metric

    scores = {}
    for i, candidate in enumerate(candidates):
        correct = 0
        for inp, expected in eval_examples:
            result = llm.generate(f"{candidate}\n\nInput: {inp}")
            if score_fn(result, expected):
                correct += 1
        scores[i] = correct / len(eval_examples)

    # Stage 3: Select best
    best_idx = max(scores, key=scores.get)
    return {
        "best_prompt": candidates[best_idx],
        "score": scores[best_idx],
        "all_candidates": candidates,
        "all_scores": scores
    }

APE's Signature Discovery

APE's most famous result: it discovered a better zero-shot CoT prompt than the human-engineered "Let's think step by step."

APE's winner: "Let's work this out in a step by step way to be sure we have the right answer."

PromptMultiArith AccuracyGSM8K Accuracy
"Let's think step by step" (human)78.7%40.7%
APE's winning prompt82.0%43.0%

The 3-4% gain may seem small, but it's free — no extra tokens, no extra calls, just a better prompt string.

Prompt Scoring Metrics

The scoring function determines what "better" means:

MetricWhat It MeasuresBest For
Exact matchOutput == expectedClassification, factual QA
F1 / BLEU / ROUGEText overlap with referenceSummarization, translation
LLM-as-judgeAsk another LLM to rate output qualityOpen-ended generation, writing
Human preferenceSide-by-side human ratingSubjective quality (creative, dialogue)
Cost efficiencyAccuracy per dollar of API costProduction optimization
LatencyResponse timeReal-time applications

For production systems, multi-objective scoring is common: maximize accuracy while staying under a token budget.

The APE Ecosystem

APE is a research concept. Zhou et al. released a demo and the code for the original paper. Several practical tools extend similar ideas:

ToolApproachKey Feature
DSPyCompiler-style optimizationAutomatically tunes prompts and few-shot examples; multi-step pipeline optimization
OPRO (Google DeepMind)LLM-driven optimizationMeta-prompt that iteratively suggests improved prompts based on score history
PromptBenchSystematic evaluation frameworkBenchmarks prompt robustness across models, adversarial perturbations
AutoPrompt (Shin et al. 2020)Gradient-guided searchUses token gradients to find optimal trigger words
Prefix/Prompt TuningContinuous soft promptsLearns prompt embeddings via backpropagation (requires training data)

DSPy is the most mature for production use. It treats prompt engineering as a compiler problem: you write a program signature, and DSPy optimizes the prompt + few-shot examples automatically.

When APE Wins

APE beats human-written prompts when:

  • You have a clear evaluation metric (accuracy, F1, exact match)
  • You have labeled examples to score candidates against
  • The task is well-defined and narrow in scope
  • You're optimizing for a specific model (prompts that work on GPT-4 may not work on Claude)

Human-written prompts win when:

  • You have deep domain expertise the LLM lacks
  • The task requires nuanced judgment (legal, medical)
  • You're designing for safety/guardrails the LLM won't self-impose
  • You need creative/stylistic control the LLM can't evaluate objectively
  • Zero labeled data — APE needs examples for scoring

The Meta-Prompting Paradox

Who prompts the prompt engineer? APE needs:

  1. Demonstration examples — a human selects these
  2. Scoring function — a human defines what "better" means
  3. Candidate generation prompt — a human designs the meta-prompt that generates candidates
  4. Selection criteria — a human decides between accuracy, cost, latency tradeoffs

APE automates the search but not the judgment. The human role shifts from writing prompts to defining success criteria and curating evaluation data.

Recursive APE (The Rabbit Hole)

Level 0: Human writes prompts
Level 1: APE optimizes prompts
Level 2: APE optimizes the meta-prompt that optimizes prompts
Level 3: APE optimizes the scoring function that evaluates prompts
...

In practice, stop at Level 1. The meta-prompt and scoring function are better designed by humans who understand the task context.

Cost Analysis

APE is computationally expensive but a one-time cost:

StageCostNotes
Generate candidatesN × 1 callN=10-50 typically, temperature > 0 for diversity
Score candidatesN × M callsM = number of eval examples, 50-200 typical
Refinement (optional)K × (gen + score)K iterations, diminishing returns after 2-3

Total: ~500-5,000 LLM calls for one prompt optimization. Worth it if the prompt will be used thousands of times.

Comparison With Other Prompt Optimization

MethodMechanismRequires Labeled Data?Cost Per Optimization
APELLM generates + scores candidatesYes (for scoring)High (N×M calls)
DSPyCompiler optimization with telepromptersYes (training set)Medium (compiles once)
OPROIterative meta-prompt refinementNo (uses self-evaluation)Medium
Manual iterationHuman trial and errorNo (human judgment)Low compute, high human time
Gradient-basedToken-level gradient optimizationYes (training data)Low (single backward pass)

When to Skip APE

  • Your prompt is already performing at >95% of the theoretical ceiling
  • The task changes frequently (optimization cost > benefit)
  • You have fewer than 20 labeled examples for scoring
  • The prompt needs human safety review before deployment (APE can't guarantee safe outputs)