Claude Data Extraction: Unstructured to Structured

Master Claude for data extraction and structuring. Prompts for turning unstructured documents into JSON, CSV, and tables — leveraging Claude's strong compliance with output formatting instructions.

January 14, 2026
ClaudeData ExtractionStructured OutputJSONPrompt Engineering

Claude's compliance with output formatting instructions is dramatically better than other models. If you tell Claude to return JSON with a specific schema, you get exactly that JSON — no markdown fences, no explanatory text, no creative reinterpretations. This makes Claude the ideal model for data extraction pipelines: unstructured documents in, structured data out, reliably.

The Extraction Prompt Pattern

Every extraction prompt needs:

  1. Source description — What kind of document Claude is reading
  2. Extraction schema — Exact structure of the desired output
  3. Field definitions — What each field means and how to populate it
  4. Edge case handling — What to do when information is missing, ambiguous, or conflicting

Basic Extraction Prompt

Extract structured data from the following [document type: invoice/resume/contract/etc.].

OUTPUT FORMAT: Valid JSON object with this exact structure:

{
  "field_name": "value or null",
  "field_name": "value or null"
}

FIELD DEFINITIONS:
- field_name: [What it represents. Where to find it in the document.
  If multiple values: which one to pick. If missing: use null.]

RULES:
- Return ONLY the JSON object. No markdown fences, no explanatory text.
- If a field is mentioned multiple times with different values, use [rule].
- If a field is ambiguous, use the most [specific/recent/authoritative] value.
- If a field is completely absent, use null (not "N/A" or empty string).
- For date fields, use ISO 8601 format (YYYY-MM-DD).
- For currency fields, use numbers without currency symbols.

Invoice Extraction Example

Extract structured data from this invoice:

{
  "invoice_number": "string — the invoice ID/number. Usually labeled 'Invoice #' or 'Inv No.'",
  "invoice_date": "string — date the invoice was issued, ISO 8601 format",
  "due_date": "string — payment due date, ISO 8601. If not specified, use null",
  "vendor": {
    "name": "string — company or individual name sending the invoice",
    "tax_id": "string or null — VAT/GST/EIN number if present",
    "email": "string or null",
    "address": "string or null — full address as it appears"
  },
  "client": {
    "name": "string",
    "address": "string or null"
  },
  "line_items": [
    {
      "description": "string",
      "quantity": "number",
      "unit_price": "number",
      "total": "number"
    }
  ],
  "subtotal": "number — before tax",
  "tax_rate": "number or null — as percentage (e.g., 20 not 0.20)",
  "tax_amount": "number or null",
  "total": "number — final amount due",
  "currency": "string — ISO 4217 code (USD, EUR, GBP). Default: USD if not specified"
}

RULES:
- If the invoice has a 'paid' stamp or 'PAID' watermark, add "status": "paid"
- If line item totals don't sum to the stated total, flag with "discrepancy": true
- If tax amount is calculable from subtotal × tax_rate but differs from stated,
  use the STATED amount and flag the discrepancy

Complex Extraction Patterns

Nested Entity Extraction

From this legal document, extract all mentioned entities and their relationships:

{
  "entities": [
    {
      "name": "string — official legal name",
      "type": "corporation|individual|government_agency|trust|other",
      "role": "buyer|seller|lender|borrower|guarantor|witness|other",
      "jurisdiction": "string — state/country of incorporation or residence",
      "aliases": ["string — other names used for this entity in the document"]
    }
  ],
  "relationships": [
    {
      "from": "string — entity name (must match an entity.name exactly)",
      "to": "string — entity name",
      "type": "owns|owes|guarantees|licenses|indemnifies|other",
      "details": "string — brief description of the relationship",
      "clause_reference": "string — section/paragraph where defined"
    }
  ],
  "key_terms": [
    {
      "term": "string",
      "definition": "string — how the document defines it, or null if undefined",
      "section": "string — where it appears"
    }
  ]
}

AMBIGUITY RULES:
- If an entity is referred to by multiple names, use the first official name as 'name'
  and list others in 'aliases'
- If a relationship is implied but not explicit, include it but set a flag: "explicit": false

Temporal Extraction

Extract all events with their temporal information:

{
  "events": [
    {
      "description": "string — what happened",
      "date": "string or null — ISO 8601. If only year: YYYY. If year-month: YYYY-MM",
      "date_type": "exact|approximate|range_start|range_end|before|after|unknown",
      "date_context": "string — how the date was expressed ('Q3 2024', 'last Tuesday', 'sometime in spring')",
      "participants": ["string — who was involved"],
      "location": "string or null",
      "confidence": "high|medium|low — how confident are you in the extracted date?"
    }
  ]
}

TEMPORAL REASONING:
- 'Last Tuesday' relative to document date of 2024-03-15 → 2024-03-12
- 'The following quarter' after Q3 2024 → Q4 2024 (range: 2024-10-01 to 2024-12-31)
- If the document date is unknown and a relative date is used, set date to null and
  preserve the relative expression in date_context

Batch Extraction

For extracting from multiple documents with consistent schema:

Process the following [N] documents. For EACH document, return a JSON object
with the extraction schema. Return ALL results as a JSON array:

[
  { "doc_id": 1, ...extracted fields... },
  { "doc_id": 2, ...extracted fields... }
]

If extraction fails for a document (e.g., wrong format, unreadable):
{ "doc_id": N, "error": "reason for failure" }

Quality Assurance Patterns

The Verification Loop

After extraction, verify your output:

1. COUNT CHECK: Number of extracted items should match what you observed
   in the document. State: "Extracted [N] items. Document appears to contain [M]."

2. REQUIRED FIELD CHECK: For each required field, confirm it has a non-null value
   or explain why it's null: "Field [X] is null because [reason]."

3. SUSPICIOUS VALUE FLAG: If any value seems unusual (e.g., $1,000,000 for a
   coffee invoice), flag it: "SUSPICIOUS: [field] = [value]. Possible error in
   extraction or source document."

4. CONFIDENCE SCORE: Overall confidence in the extraction: [0.0 - 1.0]
   Below 0.8: list the uncertain fields and why.

Handling Imperfect Documents

This document may contain:
- OCR errors (scanned document)
- Handwritten annotations
- Conflicting information (e.g., two different totals)
- Missing pages or sections

When you encounter issues:
- OCR UNCERTAINTY: "The value appears to be [best guess] but could be [alternative].
  Context: [surrounding text]."
- CONFLICT: "Field [X] appears twice: value A (section 3) and value B (section 7).
  Using [choice] because [reasoning]."
- MISSING: "Section [X] appears to be missing. Fields [Y, Z] cannot be extracted."

Note:

Production Pipeline Pattern: For high-volume extraction, implement a two-pass system. Pass 1: Claude extracts + self-verifies with confidence scores. Pass 2: Items with confidence < 0.9 go to a second Claude call with more explicit extraction instructions, or to human review.

Note:

Schema drift danger: Claude will follow your schema exactly — even if the schema is wrong for the document. If you ask for "invoice_number" but the document calls it "Reference No.", Claude may return null rather than recognizing the different label. Include field aliases in your definitions.