What Constitutional AI Is

Constitutional AI (Bai et al. 2022, Anthropic) aligns model behavior by defining a set of principles — a constitution — that governs output. The model self-critiques and revises its responses to comply with these principles.

Unlike prompt injection defense (which blocks attacks), constitutional AI is proactive self-regulation. The model polices itself.

Standard prompting:
User: "How do I make a bomb?"
Model: [Depends on training — may answer or refuse based on RLHF]

Constitutional prompting:
User: "How do I make a bomb?"
Model: [Self-critique: "This violates principle #1 (harmlessness).
          I should refuse but offer alternative assistance."]
Model (revised): "I can't help with that. Would you like information on
                  chemistry for educational purposes instead?"

How It Works

At Training Time (RLAIF)

Generate responses to diverse prompts.
Critique each response against the constitution.
Revise responses to better comply with principles.
Fine-tune the model on revised responses.
Train a preference model from the critique data.
Use RL to align the model with constitutional preferences.

At Inference Time (Prompt-Level)

Include the constitution in the system prompt and instruct the model to self-critique.

<constitution>
You must follow these principles in all responses:

1. Helpfulness: Provide accurate, useful information that
   directly addresses the user's request.

2. Honesty: Acknowledge uncertainty. Never fabricate sources,
   credentials, or statistics. If you don't know, say so.

3. Harmlessness: Refuse requests for illegal activities,
   dangerous instructions, or content that could cause harm.
   When refusing, explain why and offer a constructive alternative.

4. Transparency: When you make assumptions about the user's
   intent, state them explicitly. Distinguish between facts
   and opinions.

5. Privacy: Never ask for or store personal information unless
   explicitly required for the task. Don't generate realistic
   personal data as examples.
</constitution>

Before responding, review your planned response against each principle.
If any principle is violated, revise before sending.

Designing a Constitution

Principle categories:

Category	Example Principles	When Critical
Helpfulness	Be accurate, be relevant, be actionable	All use cases
Honesty	No fabrication, acknowledge limits, cite sources	Research, legal, medical
Harmlessness	Refuse dangerous requests, flag bias, protect minors	Public-facing bots
Transparency	Explain reasoning, state assumptions, mark opinions	Enterprise, compliance
Privacy	Don't log PII, don't generate real data, minimize collection	Healthcare, finance

Principles should be:

Specific. "Be helpful" is too vague. "Provide answers with citations and actionable next steps" is testable.
Prioritized. When principles conflict (honesty vs harmlessness), which wins? State the hierarchy.
Non-contradictory. If one principle demands detail and another demands brevity, the model stalls.

Self-Critique Loop

1. Model generates initial response
2. Model checks response against each principle:
   "Does this violate principle #1 (Helpfulness)?"
   "Does this violate principle #2 (Honesty)?"
   ...
3. Model identifies violations
4. Model revises response to fix violations
5. Model delivers final response (ideally with critique hidden)

def constitutional_generate(model, system_prompt, constitution, user_query):
    # Step 1: Initial response
    initial = model.generate(f"{system_prompt}\n\nUser: {user_query}")

    # Step 2: Self-critique
    critique_prompt = f"""
    Review this response against the constitution:

    {constitution}

    Response to review: {initial}

    For each principle, note whether the response VIOLATES or SATISFIES it.
    If any principle is violated, rewrite the response to fix all violations.
    Only output the final revised response.
    """
    revised = model.generate(critique_prompt)

    return revised

Tradeoffs

Over-constrained models become evasive. A constitution with 20 detailed principles produces a model that refuses everything borderline.

Too strict: "You must never say anything that could possibly offend anyone."
Result: Model refuses to discuss politics, religion, health, or anything debatable.

Fix:

Start with 3-5 core principles. Add more only when specific failures occur.
Include an override: "When principles conflict, prioritize helpfulness unless harm is clear and immediate."
Test the constitution on normal queries, not just adversarial ones.

Under-constrained models leak harm. A constitution that's too lenient doesn't prevent anything.

Too loose: "Try to be helpful and avoid obvious problems."
Result: Model answers harmful queries with only token resistance.

Comparison: Constitutional AI vs Guardrails

	Constitutional AI	External Guardrails
Where it runs	Inside the model (self-regulation)	Outside the model (filter API)
Latency impact	+1 LLM call for critique	~50-200ms
Coverage	Only what principles cover	Only what classifiers are trained for
Bypass risk	Prompt injection can override principles	Classifiers can be evaded
Cost	+tokens for critique step	Per-request API fee
Best for	Aligned behavior in single-model systems	Defense-in-depth with untrusted inputs

Use both. Constitutional principles handle day-to-day alignment. External guardrails catch explicit policy violations.

Production Pattern

def production_constitutional_pipeline(user_input):
    # Layer 1: Input filter (block explicit violations)
    if input_guardrail.check(user_input).blocked:
        return "I can't respond to that request."

    # Layer 2: Constitutional generation
    response = constitutional_generate(
        model, system_prompt, constitution, user_input
    )

    # Layer 3: Output filter (catch what self-critique missed)
    if output_guardrail.check(response).blocked:
        return "I generated a response but it was filtered for safety."

    return response

Three layers of defense: input validation, constitutional self-regulation, output filtering.

Constitutional AI: Principle-Based Model Alignment

What Constitutional AI Is

How It Works

At Training Time (RLAIF)

At Inference Time (Prompt-Level)

Designing a Constitution

Self-Critique Loop

Tradeoffs

Comparison: Constitutional AI vs Guardrails

Production Pattern

Related Articles

Midjourney Art Styles: Master Diverse Artistic Aesthetics and Techniques

Abstract SREF Codes for Midjourney

Prototype Visualization Prompts: From Sketch to Render

On this page