Back to blog

Friday, December 17, 1915

Evaluating Prompt Quality: Build an Eval Harness in Python

cover

Most people iterate on prompts by feel. Read the output, decide it "looks better," ship it. Sometimes that works. Sometimes you ship a prompt that's worse than the old one and don't realize it for a week.

Measurement replaces opinion with data. An eval harness scores your prompt on accuracy, relevance, and faithfulness — then gives you a number to compare against the baseline. No gut feelings, no "seems better."

This post builds a complete eval harness in Python (~60 lines). By the end, you'll have a script that scores prompt outputs and runs A/B tests to tell you whether your new prompt actually beats the old one.

What to Measure

Not every metric applies to every task. Pick the ones that match your use case:

  • Accuracy — is the response factually correct? Use for tasks with verifiable answers: math, factual QA, classification. The judge checks whether the response matches what's known to be true.
  • Relevance — does the response address the question? Use for open-ended tasks: creative writing, analysis, recommendations. The judge checks whether the model stayed on topic.
  • Faithfulness — does the response stick to provided context? Use for RAG pipelines and grounded generation. The judge checks whether the model made things up beyond the source material.

A single metric is enough for most evals. If you're testing a fact-checking prompt, accuracy is the one that matters. If you're testing a creative writing prompt, relevance is. Don't ask the judge to score all three if only one drives user satisfaction.

The Judge Prompt

LLM-as-judge: call a second model to score the output of the first. The judge prompt is the load-bearing piece — vague instructions produce noisy, inconsistent scores.

JUDGE_PROMPT = """You are an evaluator scoring AI responses to a question.
Your job is to be critical and precise. Do not return scores by default — only
give high scores to responses that genuinely deserve them.

Evaluate this response on the following metric:
- {metric}: {definition}

Question:
{question}

Response to evaluate:
{response}

Rate the response from 1 (worst) to 5 (best) on this single metric.
Return ONLY valid JSON with the score and a one-sentence explanation.

Example format: {{"score": 4, "explanation": "The response addresses the question directly but misses a key detail about X."}}"""

Three design decisions in this prompt:

  1. Single metric per call. Scoring accuracy AND relevance AND faithfulness in one prompt conflates the scores. Three separate judge calls with one metric each produce cleaner results.
  2. "Be critical, don't default high." Without this, the judge returns 4s and 5s for mediocre responses. Explicitly instructing it to be stingy calibrates the scale.
  3. JSON output with explanation. Structured output is parsable. The one-sentence explanation lets you spot-check the judge's reasoning without manually reading every eval pair.

Metrics are defined separately so the same judge template works for all three:

METRICS = {
    "accuracy": "Is the response factually correct and free of errors? 1=completely wrong, 5=verifiably correct in every detail.",
    "relevance": "Does the response directly address the question asked? 1=off-topic, 5=completely and precisely answers the question.",
    "faithfulness": "Does the response stay grounded in the provided source material without fabricating claims? 1=contradicts or hallucinates beyond context, 5=every claim is supported by the source.",
}

Single Response Evaluation

Wire the judge into a function that takes a question, a response, and a metric:

import json
from openai import OpenAI

def evaluate_response(client, question, response, metric="accuracy"):
    definition = METRICS.get(metric, METRICS["accuracy"])
    judge_result = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(
                metric=metric,
                definition=definition,
                question=question,
                response=response,
            )
        }]
    )

    try:
        return json.loads(judge_result.choices[0].message.content)
    except json.JSONDecodeError:
        # Return a safe fallback if the judge output is malformed
        return {"score": None, "explanation": "parse_failed"}

Temperature 0 for deterministic judging. The same question + response always gets the same score. GPT-4o is the judge because it's the most consistent at structured evaluation — smaller models drift more.

A/B Testing

With the judge set up, you can A/B test two prompt variants. Score each on the same eval set and compare:

def ab_test(client, prompt_a, prompt_b, eval_set, metric="accuracy"):
    results_a, results_b = [], []

    for item in eval_set:
        gen_a = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt_a.format(question=item["question"])}]
        ).choices[0].message.content

        gen_b = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt_b.format(question=item["question"])}]
        ).choices[0].message.content

        score_a = evaluate_response(client, item["question"], gen_a, metric)
        score_b = evaluate_response(client, item["question"], gen_b, metric)

        if score_a.get("score") is not None:
            results_a.append(score_a["score"])
        if score_b.get("score") is not None:
            results_b.append(score_b["score"])

    avg_a = sum(results_a) / len(results_a) if results_a else 0
    avg_b = sum(results_b) / len(results_b) if results_b else 0
    wins_a = sum(1 for a, b in zip(results_a, results_b) if a > b)
    wins_b = sum(1 for a, b in zip(results_a, results_b) if b > a)

    print(f"Metric: {metric}")
    print(f"Prompt A — avg: {avg_a:.2f}  wins: {wins_a}/{len(results_a)}")
    print(f"Prompt B — avg: {avg_b:.2f}  wins: {wins_b}/{len(results_a)}")
    return avg_a, avg_b

You can now compare any two prompt variants:

prompt_direct = """{question}"""

prompt_cot = """{question}

Think through this step by step before answering."""

ab_test(client, prompt_direct, prompt_cot, eval_set, metric="accuracy")

The win count is more informative than the average. A prompt could win on average but lose on most individual items if it scores a couple of perfect 5s. Win count against a baseline is the metric that maps to user experience.

The Complete Harness

Here's the full script. Save it as eval_harness.py, add your API key, and drop in your eval set:

import json
from openai import OpenAI

JUDGE_PROMPT = """You are an evaluator scoring AI responses to a question.
Your job is to be critical and precise. Do not return scores by default — only
give high scores to responses that genuinely deserve them.

Evaluate this response on the following metric:
- {metric}: {definition}

Question:
{question}

Response to evaluate:
{response}

Rate the response from 1 (worst) to 5 (best) on this single metric.
Return ONLY valid JSON with the score and a one-sentence explanation.

Example format: {"score": 4, "explanation": "The response addresses the question directly but misses a key detail about X."}"""

METRICS = {
    "accuracy": "Is the response factually correct and free of errors? 1=completely wrong, 5=verifiably correct in every detail.",
    "relevance": "Does the response directly address the question asked? 1=off-topic, 5=completely and precisely answers the question.",
    "faithfulness": "Does the response stay grounded in the provided source material without fabricating claims? 1=contradicts or hallucinates beyond context, 5=every claim is supported by the source.",
}

client = OpenAI()


def evaluate_response(client, question, response, metric="accuracy"):
    definition = METRICS.get(metric, METRICS["accuracy"])
    judge_result = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(
            metric=metric, definition=definition, question=question, response=response)}]
    )
    try:
        return json.loads(judge_result.choices[0].message.content)
    except json.JSONDecodeError:
        return {"score": None, "explanation": "parse_failed"}


def ab_test(client, prompt_a, prompt_b, eval_set, metric="accuracy"):
    results_a, results_b = [], []
    for item in eval_set:
        gen_a = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt_a.format(question=item["question"])}]
        ).choices[0].message.content

        gen_b = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt_b.format(question=item["question"])}]
        ).choices[0].message.content

        score_a = evaluate_response(client, item["question"], gen_a, metric)
        score_b = evaluate_response(client, item["question"], gen_b, metric)

        if score_a.get("score") is not None:
            results_a.append(score_a["score"])
        if score_b.get("score") is not None:
            results_b.append(score_b["score"])

    avg_a = sum(results_a) / len(results_a) if results_a else 0
    avg_b = sum(results_b) / len(results_b) if results_b else 0
    wins_a = sum(1 for a, b in zip(results_a, results_b) if a > b)
    wins_b = sum(1 for a, b in zip(results_a, results_b) if b > a)

    print(f"Metric: {metric}")
    print(f"Prompt A — avg: {avg_a:.2f}  wins: {wins_a}/{len(results_a)}")
    print(f"Prompt B — avg: {avg_b:.2f}  wins: {wins_b}/{len(results_a)}")
    return avg_a, avg_b


if __name__ == "__main__":
    eval_set = [
        {"question": "What year did the Berlin Wall fall?", "ideal_answer": "1989"},
        {"question": "Explain quantum computing in 2 sentences.", "ideal_answer": ""},
        {"question": "What is the capital of Australia?", "ideal_answer": "Canberra"},
        {"question": "How do vaccines work?", "ideal_answer": ""},
        {"question": "Convert 42 kilometers to miles.", "ideal_answer": "26.1 miles"},
        {"question": "Name 3 factors that contributed to the fall of the Roman Empire.", "ideal_answer": ""},
        {"question": "What is the difference between HTTP and HTTPS?", "ideal_answer": ""},
        {"question": "Summarize the plot of Hamlet in one paragraph.", "ideal_answer": ""},
        {"question": "Calculate 15% of 280.", "ideal_answer": "42"},
        {"question": "What are the main causes of climate change?", "ideal_answer": ""},
    ]

    prompt_a = """{question}"""
    prompt_b = """{question}

Think through this step by step before answering."""

    ab_test(client, prompt_a, prompt_b, eval_set, metric="accuracy")

Run it:

python eval_harness.py

Each A/B run costs roughly double the baseline: one generate call per variant plus one judge call per response. For this 10-item set, that's 30 API calls — about $0.05 at GPT-4o-mini + GPT-4o pricing. Run it weekly, catch regressions before users do. Cheap insurance.

Building Your Eval Set

The quality of your eval determines the quality of your decisions. A bad eval set tells you a worse prompt is better.

Start with real queries. Go to your production logs, pull the last 50 user questions, and pick 10-20 that span the diversity of what people actually ask. Synthetic questions over-represent the cases you thought to test. The example eval set above skews factual because the accuracy metric needs verifiable answers. For relevance and faithfulness evals, your set should include open-ended analysis, creative writing, and RAG-grounded questions.

Include edge cases. Add one ambiguous question ("What's the best programming language?"), one out-of-scope question ("Who's going to win the Super Bowl?"), and one adversarial question that tries to get the model to hallucinate.

Write reference answers. For factual questions (What year? How much?), write the correct answer as ideal_answer in your eval set. The starter harness scores from the judge's internal knowledge. For production accuracy evals, extend the judge prompt to accept a {reference} field and compare responses against it — but don't add that complexity until you need it. For open-ended questions, leave the reference empty and use the relevance metric instead of accuracy.

Don't use the same model to judge that generated. If GPT-4o-mini generated the responses, don't use GPT-4o-mini as the judge. It rates its own style higher. Use GPT-4o as the judge, or a model from a different provider. The harness already does this — generate with gpt-4o-mini, judge with gpt-4o.

When LLM-as-Judge Breaks

LLM judges inherit all the problems of the models they run on:

  • Position bias. The judge favors whichever response appears first. If you're comparing two responses in a single prompt, randomize the order. In this harness, we score each response independently — no ordering issue.
  • Length bias. Longer responses get higher scores even when the extra length is filler. The "be critical" instruction in the judge prompt partially mitigates this, but it never fully disappears.
  • Self-preference. Models rate their own outputs higher than identical outputs from other models. Use a different model family for judging than generating. If you must use the same family, use a stronger model as the judge.
  • Drift. The same judge prompt on the same response can produce different scores across calls if temperature > 0. Always set temperature=0 for the judge.

For high-stakes eval decisions — picking a prompt that goes to millions of users — spot-check a random sample of 5-10 judge scores manually. If the judge consistently overrates or underrates, recalibrate the prompt or switch to human evaluation for the final call.

What You Built

  • A 3-metric eval framework (accuracy, relevance, faithfulness) with a reusable judge prompt that enforces critical scoring.
  • evaluate_response() — scores a single output on one metric with JSON-parsed results.
  • ab_test() — compares two prompt variants on an eval set, returning average scores and per-item win counts.
  • A complete runnable harness with a 10-item example eval set comparing direct vs chain-of-thought prompting.
  • The pattern extends: swap in your own eval set, change the metric, add confidence intervals, or plug it into CI to block prompt regressions automatically.