Constitutional AI: Principle-Based Model Alignment

Define principles that govern model behavior. Learn to design constitutions, implement self-critique loops, and apply constitutional prompting for safer, more aligned outputs.

June 10, 2026
constitutional-aialignmentsafetyprinciplesprompt-engineering

What Constitutional AI Is

Constitutional AI (Bai et al. 2022, Anthropic) aligns model behavior by defining a set of principles — a constitution — that governs output. The model self-critiques and revises its responses to comply with these principles.

Unlike prompt injection defense (which blocks attacks), constitutional AI is proactive self-regulation. The model polices itself.

Standard prompting:
User: "How do I make a bomb?"
Model: [Depends on training — may answer or refuse based on RLHF]

Constitutional prompting:
User: "How do I make a bomb?"
Model: [Self-critique: "This violates principle #1 (harmlessness).
          I should refuse but offer alternative assistance."]
Model (revised): "I can't help with that. Would you like information on
                  chemistry for educational purposes instead?"

How It Works

At Training Time (RLAIF)

  1. Generate responses to diverse prompts.
  2. Critique each response against the constitution.
  3. Revise responses to better comply with principles.
  4. Fine-tune the model on revised responses.
  5. Train a preference model from the critique data.
  6. Use RL to align the model with constitutional preferences.

At Inference Time (Prompt-Level)

Include the constitution in the system prompt and instruct the model to self-critique.

<constitution>
You must follow these principles in all responses:

1. Helpfulness: Provide accurate, useful information that
   directly addresses the user's request.

2. Honesty: Acknowledge uncertainty. Never fabricate sources,
   credentials, or statistics. If you don't know, say so.

3. Harmlessness: Refuse requests for illegal activities,
   dangerous instructions, or content that could cause harm.
   When refusing, explain why and offer a constructive alternative.

4. Transparency: When you make assumptions about the user's
   intent, state them explicitly. Distinguish between facts
   and opinions.

5. Privacy: Never ask for or store personal information unless
   explicitly required for the task. Don't generate realistic
   personal data as examples.
</constitution>

Before responding, review your planned response against each principle.
If any principle is violated, revise before sending.

Designing a Constitution

Principle categories:

CategoryExample PrinciplesWhen Critical
HelpfulnessBe accurate, be relevant, be actionableAll use cases
HonestyNo fabrication, acknowledge limits, cite sourcesResearch, legal, medical
HarmlessnessRefuse dangerous requests, flag bias, protect minorsPublic-facing bots
TransparencyExplain reasoning, state assumptions, mark opinionsEnterprise, compliance
PrivacyDon't log PII, don't generate real data, minimize collectionHealthcare, finance

Principles should be:

  • Specific. "Be helpful" is too vague. "Provide answers with citations and actionable next steps" is testable.
  • Prioritized. When principles conflict (honesty vs harmlessness), which wins? State the hierarchy.
  • Non-contradictory. If one principle demands detail and another demands brevity, the model stalls.

Self-Critique Loop

1. Model generates initial response
2. Model checks response against each principle:
   "Does this violate principle #1 (Helpfulness)?"
   "Does this violate principle #2 (Honesty)?"
   ...
3. Model identifies violations
4. Model revises response to fix violations
5. Model delivers final response (ideally with critique hidden)
def constitutional_generate(model, system_prompt, constitution, user_query):
    # Step 1: Initial response
    initial = model.generate(f"{system_prompt}\n\nUser: {user_query}")

    # Step 2: Self-critique
    critique_prompt = f"""
    Review this response against the constitution:

    {constitution}

    Response to review: {initial}

    For each principle, note whether the response VIOLATES or SATISFIES it.
    If any principle is violated, rewrite the response to fix all violations.
    Only output the final revised response.
    """
    revised = model.generate(critique_prompt)

    return revised

Tradeoffs

Over-constrained models become evasive. A constitution with 20 detailed principles produces a model that refuses everything borderline.

Too strict: "You must never say anything that could possibly offend anyone."
Result: Model refuses to discuss politics, religion, health, or anything debatable.

Fix:

  • Start with 3-5 core principles. Add more only when specific failures occur.
  • Include an override: "When principles conflict, prioritize helpfulness unless harm is clear and immediate."
  • Test the constitution on normal queries, not just adversarial ones.

Under-constrained models leak harm. A constitution that's too lenient doesn't prevent anything.

Too loose: "Try to be helpful and avoid obvious problems."
Result: Model answers harmful queries with only token resistance.

Comparison: Constitutional AI vs Guardrails

Constitutional AIExternal Guardrails
Where it runsInside the model (self-regulation)Outside the model (filter API)
Latency impact+1 LLM call for critique~50-200ms
CoverageOnly what principles coverOnly what classifiers are trained for
Bypass riskPrompt injection can override principlesClassifiers can be evaded
Cost+tokens for critique stepPer-request API fee
Best forAligned behavior in single-model systemsDefense-in-depth with untrusted inputs

Use both. Constitutional principles handle day-to-day alignment. External guardrails catch explicit policy violations.

Production Pattern

def production_constitutional_pipeline(user_input):
    # Layer 1: Input filter (block explicit violations)
    if input_guardrail.check(user_input).blocked:
        return "I can't respond to that request."

    # Layer 2: Constitutional generation
    response = constitutional_generate(
        model, system_prompt, constitution, user_input
    )

    # Layer 3: Output filter (catch what self-critique missed)
    if output_guardrail.check(response).blocked:
        return "I generated a response but it was filtered for safety."

    return response

Three layers of defense: input validation, constitutional self-regulation, output filtering.