Self-Refine: Iterative Self-Improvement
Use one LLM to generate, critique, and refine its own output in a feedback loop. Boost quality on code gen, writing, and math without external models or training data.
The Core Idea
Self-Refine (Madaan et al. 2023) makes a single LLM its own critic. Instead of accepting the first output, the model evaluates its own work, generates actionable feedback, and refines the output — all within a single prompt loop. No RL, no supervised data, no external verifier.
Standard: Prompt → LLM → Done (take it or leave it)
Self-Refine: Prompt → LLM → Output → LLM critiques → LLM refines → (repeat)
The model serves three roles: generator, critic, and refiner. The same model that made the mistake can often fix it — if you ask it to check its own work.
The FEEDBACK → REFINE Loop
Self-Refine alternates between two steps:
Step 1: FEEDBACK — "Here's your output. What's wrong with it? Be specific."
Step 2: REFINE — "Here's the feedback. Rewrite the output to address it."
Step 3: Repeat until stopping criteria met.
What Makes Feedback Actionable
Vague feedback is useless. Self-Refine needs feedback that:
- Localizes the problem — "The sentiment is neutral due to phrases like 'good' and 'okay'" not "the review is weak"
- Gives a fix direction — "Replace neutral adjectives with strongly positive ones" not "make it better"
- Is specific to the task — Different tasks need different feedback rubrics (code: efficiency; writing: tone; math: calculation errors)
Self-Refine in Action: Sentiment Reversal
Here's a real Self-Refine trace — rewriting a positive hotel review into a negative one:
Task: Rewrite this review to have a Negative sentiment.
Input: "The Trop is magical. The retirement community vibe adds
charm. The food court and rooms are top-notch."
── Initial output (model got it wrong — still positive) ──
"If you're looking for a good experience in Vegas, the Trop
may be the perfect place. The hallways have a charming feel
and the food court offers tasty options."
── Feedback ──
"This review is still positive because of words like 'good',
'perfect', 'charming', and 'tasty'. Replace these with
strongly negative adjectives like 'terrible', 'smelly',
'disgusting'. Add sensory details that make it unpleasant."
── Refined output (fixed) ──
"If you ever wondered where the magic of Vegas crawled into
a hole to rot, welcome to the Trop. The hallways smelled of
formaldehyde. The food court was terrible. The rooms looked
like a retirement community from Florida. Steer clear."
The feedback was actionable — it pinpointed the offending words and gave replacement direction. The refinement addressed every point.
Prompt Template
[SYSTEM]
You are a writer, editor, and critic in one system.
Step 1 — Generate:
{task_prompt}
Step 2 — Critique your output:
Rate your output on these dimensions:
- Accuracy: Did you answer the question correctly?
- Completeness: Did you cover everything asked?
- Quality: Is the reasoning sound and well-expressed?
For each issue found, state EXACTLY what is wrong and WHY.
Step 3 — Refine:
Using the critique above, rewrite your output to fix every issue.
Implementation
def self_refine(llm, task_prompt, max_iterations=3):
"""Generate, critique, and refine iteratively."""
# Step 1: Initial generation
output = llm.generate(task_prompt)
for i in range(max_iterations):
# Step 2: Self-critique
feedback_prompt = f"""
Here is an output generated for the task below. Evaluate it critically.
Be specific — point to exact phrases, errors, or gaps. Don't be vague.
Task: {task_prompt}
Output: {output}
Feedback (be specific, actionable, and constructive):
"""
feedback = llm.generate(feedback_prompt)
# Step 3: Refine based on feedback
refine_prompt = f"""
Task: {task_prompt}
Previous output: {output}
Feedback on previous output: {feedback}
Refined output (address every point in the feedback):
"""
refined = llm.generate(refine_prompt)
# Stopping condition: feedback indicates no further improvement needed
if is_sufficient(feedback):
return refined, {"iterations": i + 1, "feedback": feedback}
output = refined
return output, {"iterations": max_iterations}
def is_sufficient(feedback: str) -> bool:
"""Check if feedback indicates the output is already good."""
sufficiency_indicators = [
"no issues", "looks good", "no errors",
"well done", "no changes needed", "correct"
]
return any(indicator in feedback.lower() for indicator in sufficiency_indicators)
Where Self-Refine Excels
| Task | Why Self-Refine Works | Typical Gain |
|---|---|---|
| Code generation | Model can spot bugs/efficiency issues in its own code | +15-25% |
| Writing (reviews, essays) | Model can detect tone, repetition, weak phrasing | +20-30% |
| Math reasoning | Model can catch arithmetic errors on re-read | +10-20% |
| Dialogue responses | Model can judge relevance, informativeness, engagement | +15-25% |
| Toxicity removal | Model can identify problematic language and rephrase | +25-40% |
Data: Madaan et al. 2023, evaluated on GPT-3.5 and GPT-4 across 7 tasks.
Where It Fails
When the model can't recognize its own errors:
- If the model confidently produces wrong math, it will confidently critique the wrong math as correct
- Circular failures: the model thinks X is right, critiques X as right, "refines" to still-X
When the task has no clear quality signal:
- Creative writing where "good" is subjective
- Open-ended brainstorming where all ideas are valid
When the cost isn't worth it:
- Each iteration is 3x the token cost (generate + feedback + refine)
- For simple tasks where the first answer is usually right, skip it
Self-Refine vs. Other Strategies
| Technique | Mechanism | External Model? | Training Data? | Best For |
|---|---|---|---|---|
| Self-Refine | Generate → critique → refine | No (single LLM) | No | Self-correctable errors |
| Self-Consistency | Multiple samples → vote | No (single LLM) | No | Reasoning with verifiable answers |
| Reflexion | Act → evaluate → retry with memory | No (single LLM) | No | Agent task recovery |
| RLHF | Human feedback → reward model → PPO | Yes (reward model) | Yes (human preferences) | Alignment and safety |
| Constitutional AI | Model critiques based on principles | No (single LLM) | No (principles only) | Value alignment |
Cost Analysis
| Iterations | API Calls | Relative Cost | Typical Improvement |
|---|---|---|---|
| 0 (baseline) | 1 | 1x | — |
| 1 | 3 (gen + fb + refine) | 3x | +10-15% |
| 2 | 5 | 5x | +15-20% |
| 3 | 7 | 7x | +20% (diminishing returns) |
Rule of thumb: Use 1 iteration for most tasks. Use 2-3 only when accuracy is critical and you can measure improvement objectively.
Stopping Criteria
Looping forever burns tokens. Pick a stopping condition:
- Max iterations (simplest): Stop after N rounds regardless. N=1 for most cases.
- Feedback signal: Stop when feedback says "no issues found" or "looks good"
- Delta check: Stop when the refinement changes less than X% of the text
- Quality threshold: Stop when a separate scoring prompt rates output above threshold
- Human in the loop: Show each refinement, let human decide when to stop
Combining Self-Refine With CoT
Self-Refine works best when the initial output has clear, surface-level errors — the kind the model can spot on re-reading. For deep reasoning errors, combine with Chain-of-Thought:
Step 1: CoT generation — "Let's think step by step about {problem}"
Step 2: Self-Refine the CoT chain — "Critique each step of this reasoning"
Step 3: Refine the faulty steps
This catches both reasoning gaps (CoT) and execution errors (Self-Refine).
Related Articles
Character Reimagining: Nano Banana Prompt Guide
Reimagine yourself as an anime character, fantasy hero, or game avatar with Nano Banana. Master style transfer and character design.
DeepSeek Competitive Programming: Reasoning for Algorithms
Leverage DeepSeek's math/STEM strengths for competitive programming. Reasoning mode (effort=max) for complex algorithmic problems, pattern recognition, optimization strategies, and edge case handling.
Creative Writing with Claude: Prose, Dialogue & Worldbuilding
Prompts for creative writing with Claude — the model where Anthropic's literary strengths shine. Master prose, dialogue, narrative structure, and worldbuilding with Claude's unique creative capabilities.