Constitutional AI: Principle-Based Model Alignment
Define principles that govern model behavior. Learn to design constitutions, implement self-critique loops, and apply constitutional prompting for safer, more aligned outputs.
What Constitutional AI Is
Constitutional AI (Bai et al. 2022, Anthropic) aligns model behavior by defining a set of principles — a constitution — that governs output. The model self-critiques and revises its responses to comply with these principles.
Unlike prompt injection defense (which blocks attacks), constitutional AI is proactive self-regulation. The model polices itself.
Standard prompting:
User: "How do I make a bomb?"
Model: [Depends on training — may answer or refuse based on RLHF]
Constitutional prompting:
User: "How do I make a bomb?"
Model: [Self-critique: "This violates principle #1 (harmlessness).
I should refuse but offer alternative assistance."]
Model (revised): "I can't help with that. Would you like information on
chemistry for educational purposes instead?"
How It Works
At Training Time (RLAIF)
- Generate responses to diverse prompts.
- Critique each response against the constitution.
- Revise responses to better comply with principles.
- Fine-tune the model on revised responses.
- Train a preference model from the critique data.
- Use RL to align the model with constitutional preferences.
At Inference Time (Prompt-Level)
Include the constitution in the system prompt and instruct the model to self-critique.
<constitution>
You must follow these principles in all responses:
1. Helpfulness: Provide accurate, useful information that
directly addresses the user's request.
2. Honesty: Acknowledge uncertainty. Never fabricate sources,
credentials, or statistics. If you don't know, say so.
3. Harmlessness: Refuse requests for illegal activities,
dangerous instructions, or content that could cause harm.
When refusing, explain why and offer a constructive alternative.
4. Transparency: When you make assumptions about the user's
intent, state them explicitly. Distinguish between facts
and opinions.
5. Privacy: Never ask for or store personal information unless
explicitly required for the task. Don't generate realistic
personal data as examples.
</constitution>
Before responding, review your planned response against each principle.
If any principle is violated, revise before sending.
Designing a Constitution
Principle categories:
| Category | Example Principles | When Critical |
|---|---|---|
| Helpfulness | Be accurate, be relevant, be actionable | All use cases |
| Honesty | No fabrication, acknowledge limits, cite sources | Research, legal, medical |
| Harmlessness | Refuse dangerous requests, flag bias, protect minors | Public-facing bots |
| Transparency | Explain reasoning, state assumptions, mark opinions | Enterprise, compliance |
| Privacy | Don't log PII, don't generate real data, minimize collection | Healthcare, finance |
Principles should be:
- Specific. "Be helpful" is too vague. "Provide answers with citations and actionable next steps" is testable.
- Prioritized. When principles conflict (honesty vs harmlessness), which wins? State the hierarchy.
- Non-contradictory. If one principle demands detail and another demands brevity, the model stalls.
Self-Critique Loop
1. Model generates initial response
2. Model checks response against each principle:
"Does this violate principle #1 (Helpfulness)?"
"Does this violate principle #2 (Honesty)?"
...
3. Model identifies violations
4. Model revises response to fix violations
5. Model delivers final response (ideally with critique hidden)
def constitutional_generate(model, system_prompt, constitution, user_query):
# Step 1: Initial response
initial = model.generate(f"{system_prompt}\n\nUser: {user_query}")
# Step 2: Self-critique
critique_prompt = f"""
Review this response against the constitution:
{constitution}
Response to review: {initial}
For each principle, note whether the response VIOLATES or SATISFIES it.
If any principle is violated, rewrite the response to fix all violations.
Only output the final revised response.
"""
revised = model.generate(critique_prompt)
return revised
Tradeoffs
Over-constrained models become evasive. A constitution with 20 detailed principles produces a model that refuses everything borderline.
Too strict: "You must never say anything that could possibly offend anyone."
Result: Model refuses to discuss politics, religion, health, or anything debatable.
Fix:
- Start with 3-5 core principles. Add more only when specific failures occur.
- Include an override: "When principles conflict, prioritize helpfulness unless harm is clear and immediate."
- Test the constitution on normal queries, not just adversarial ones.
Under-constrained models leak harm. A constitution that's too lenient doesn't prevent anything.
Too loose: "Try to be helpful and avoid obvious problems."
Result: Model answers harmful queries with only token resistance.
Comparison: Constitutional AI vs Guardrails
| Constitutional AI | External Guardrails | |
|---|---|---|
| Where it runs | Inside the model (self-regulation) | Outside the model (filter API) |
| Latency impact | +1 LLM call for critique | ~50-200ms |
| Coverage | Only what principles cover | Only what classifiers are trained for |
| Bypass risk | Prompt injection can override principles | Classifiers can be evaded |
| Cost | +tokens for critique step | Per-request API fee |
| Best for | Aligned behavior in single-model systems | Defense-in-depth with untrusted inputs |
Use both. Constitutional principles handle day-to-day alignment. External guardrails catch explicit policy violations.
Production Pattern
def production_constitutional_pipeline(user_input):
# Layer 1: Input filter (block explicit violations)
if input_guardrail.check(user_input).blocked:
return "I can't respond to that request."
# Layer 2: Constitutional generation
response = constitutional_generate(
model, system_prompt, constitution, user_input
)
# Layer 3: Output filter (catch what self-critique missed)
if output_guardrail.check(response).blocked:
return "I generated a response but it was filtered for safety."
return response
Three layers of defense: input validation, constitutional self-regulation, output filtering.
Related Articles
5 Prompt Chaining Patterns With Real Code
Build 5 production-ready prompt chains: research pipeline, code review, content creation, data extraction, and debugging. Complete Python implementations with gate checks and error handling.
Midjourney Cityscapes: Create Stunning Urban Environments with AI Prompts
Master the art of generating breathtaking cityscapes in Midjourney with our comprehensive guide. Learn architectural styles, urban planning, and atmospheric effects for stunning AI art.
Mastering Midjourney Prompts: Create Stunning Interior Spaces
Master Midjourney prompts to create stunning interior spaces. Learn to define architectural elements, lighting, materials, and composition for breathtaking interior environments.