RAG Patterns: Retrieval-Augmented Generation
Master RAG patterns from naive to agentic. Learn retrieval strategies, prompt design, and when to use each approach for AI-powered Q&A systems.
RAG Patterns
Retrieval-Augmented Generation (RAG) combines information retrieval with LLMs to produce accurate, grounded answers. The pattern you choose — naive, advanced, or agentic — depends on your data complexity and quality requirements.
Naive RAG
The simplest pattern: retrieve documents, concatenate with the question, generate an answer.
Retrieved documents:
[doc_1] Django is a high-level Python web framework that encourages rapid development...
[doc_2] Flask is a micro-framework for Python with a minimal core...
[doc_3] FastAPI is a modern Python web framework based on Starlette...
Question: What are the best Python web frameworks for 2026?
Answer using only the retrieved documents. If the documents don't
contain enough information, say so explicitly. Cite your sources.
Limitations: No query rewriting, no reranking, no handling of missing information. If the retrieval fails, the answer fails.
When naive is sufficient: Simple FAQ, small document sets (under 100 docs), demo applications, internal tools where 80% accuracy is acceptable.
Advanced RAG
Add pre-retrieval and post-retrieval steps to improve quality.
Pre-retrieval — Query rewriting:
Original query: "How do I set it up?"
Context: User was just reading about Django REST Framework
Rewritten: "How do I set up authentication in Django REST Framework 3.14?"
Rewrite the user's question into a clear, standalone search query.
Instructions:
- Expand abbreviations and acronyms
- Resolve pronouns ("it", "that", "this") by referencing conversation context
- Add domain-specific terms that improve matching
- Output only the rewritten query
Original: {user_query}
Conversation context: {context}
Search query:
Chunking strategies — how you split documents matters:
| Strategy | Method | Best For |
|---|---|---|
| Fixed-size | Split every N characters | Simple content, logs |
| Semantic | Split at topic boundaries | Articles, documentation |
| Recursive | Split by paragraph → sentence → word | Mixed content |
| Hierarchical | Chunk + parent document reference | Long documents needing full-context answers |
Chunk by section headings, not by character count.
Each chunk should be a self-contained unit of meaning.
Include the document title and section path in each chunk's metadata.
Target size: 500-1000 tokens per chunk.
Hybrid search: Combine vector (semantic) and keyword (BM25) retrieval for better coverage.
Search using both methods:
1. Vector search — finds semantically similar chunks
2. Keyword search — finds exact term matches
Merge results from both, deduplicate, and rerank by combined score.
Weight: 0.6 vector + 0.4 keyword (adjust based on your data).
Post-retrieval — Reranking:
Rank these documents by relevance to the query. Keep only the top 3.
Query: "Python async web frameworks"
Documents:
1. [doc_A: "Introduction to Python"] — relevance: low
2. [doc_B: "FastAPI async handlers"] — relevance: high
3. [doc_C: "Django ORM tutorial"] — relevance: medium
4. [doc_D: "AIOHTTP vs FastAPI"] — relevance: high
5. [doc_E: "Python packaging guide"] — relevance: low
Kept: doc_B, doc_D, doc_C
HyDE (Hypothetical Document Embeddings): Generate a hypothetical ideal document from the query, then use that to search.
Given the question, first generate a hypothetical document that
would perfectly answer it. Then use that document's embedding
to search for real documents.
Question: {question}
Hypothetical document:
Best for: Production Q&A, documentation search, any system requiring high precision.
Agentic RAG
The agent decides when and what to retrieve, using tools dynamically. It can refine searches, try different approaches, and know when it has enough information.
You are a research assistant with access to a knowledge base.
Retrieve information only when needed.
Available tools:
- search_knowledge_base(query: string) → [documents]
Rules:
1. First, try to answer from your own knowledge
2. If unsure, search the knowledge base
3. If search results are insufficient, refine your search
4. Cite sources for any retrieved information
5. Say "I couldn't find information on that" only after 2 search attempts
Remember: you can search multiple times with different queries.
Self-querying retrieval: The model extracts structured filters from natural language.
From the user's question, extract:
- Search query (the core information need)
- Filters (date range, category, author, version)
- Sort order (relevance, date, popularity)
Question: "Show me the latest articles about React Server Components from 2025"
→ Query: "React Server Components"
→ Filters: {year: 2025}
→ Sort: by date descending
When retrieval is insufficient, the agent should adapt:
Your first search returned low-quality results. Try these strategies:
1. Simplify the query (remove jargon)
2. Use synonyms for key terms
3. Split a complex query into multiple specific searches
4. If still failing, admit the gap rather than fabricating
Best for: Complex research, ambiguous queries, scenarios where the optimal retrieval strategy isn't known upfront.
Multi-Hop RAG
Chain multiple retrievals where each result informs the next query. Essential for questions that require connecting information across documents.
Question: "Which company developed the framework used by Instagram's backend?"
Hop 1: "What framework does Instagram use?"
→ Result: "Instagram uses Django"
Hop 2: "Which company developed Django?"
→ Result: "Django was created by the Django Software Foundation"
Answer: "Instagram uses Django, which was developed by the Django Software Foundation."
You need to answer a question that may require multiple searches.
Break your approach into hops:
Hop 1: Search for the initial answer
Hop 2: Use the result to formulate the next search
Hop 3+: Continue until you can answer confidently
Set a maximum of 5 hops. If you can't answer after 5 hops,
report what you found and what's still missing.
Current hop: {hop_number}
Previous findings: {findings}
Next search query:
Best for: Multi-step reasoning, entity linking, questions requiring synthesis across documents.
Handling Common RAG Failures
| Failure | Symptom | Fix |
|---|---|---|
| Irrelevant retrieval | Answer uses unrelated docs | Improve chunking, add reranking step |
| Contradictory sources | Answer contains conflicting statements | Flag contradictions in output: "Source A says X, Source B says Y" |
| Outdated information | Answers reference old versions | Include date metadata, add freshness check in prompt |
| Missing information | Answer fabricates details | Tighten "only use retrieved docs" instruction, add refusal language |
| Too many documents | Oversized prompt, truncated output | Cap retrieved chunks at 3-5, use reranking |
Citation & Attribution Strategies
Inline citation: Cite within the answer text.
Django is a high-level Python framework [1]. Flask is better for microservices [2].
[1] Django Documentation, https://docs.djangoproject.com
[2] Flask Documentation, https://flask.palletsprojects.com
When sources conflict:
Source A states the API rate limit is 100 requests per minute.
Source B states it is 1000 requests per minute.
This appears to be a version difference. Source A is for v2.0,
Source B is for v3.0. Please verify which version you are using.
Evaluating RAG Quality
Retrieval metrics:
- Hit rate — Did we retrieve at least one relevant document?
- Mean Reciprocal Rank (MRR) — How high was the first relevant result?
- Normalized Discounted Cumulative Gain (NDCG) — Overall ranking quality
Generation metrics:
- Faithfulness — Does the answer stick to retrieved documents?
- Answer relevance — Does the answer address the question?
- Completeness — Does the answer cover all aspects?
Pattern Selection
| Pattern | Retrieval Quality | Latency | Complexity | Best For |
|---|---|---|---|---|
| Naive RAG | Low | Low | None | Simple FAQ, demos |
| Advanced RAG | Medium-High | Medium | Low-Medium | Production Q&A, docs |
| Agentic RAG | High | High | Medium | Complex research |
| Multi-Hop RAG | High | High | Medium | Entity linking, synthesis |
Best Practices
- Chunk strategically — Split by section boundaries, not character count. Each chunk should be self-contained.
- Include metadata — Store source, date, version, and section path alongside each chunk.
- Set relevance thresholds — Don't retrieve low-scoring documents; they add noise.
- Handle empty results — Tell the user when nothing relevant exists. Never fabricate.
- Cite sources — Always attribute information to its source document.
- Test your retrieval — Measure hit rate on a held-out set of questions before deploying.
- Version your documents — Multiple versions of the same doc cause confusion; use version metadata to retrieve the right one.
Related Articles
Modern Digital Anime SREF Codes for Midjourney
Master modern digital anime SREF codes for Midjourney. Create contemporary animation styles with smooth digital rendering, modern color techniques, and current production aesthetics.
Optimization Techniques
Master optimization strategies with effective prompts and practical approaches for ChatGPT.
Prompt Optimization
Learn how to systematically improve your prompts for better quality, lower costs, and faster responses from AI models.