RAG Patterns

Retrieval-Augmented Generation (RAG) combines information retrieval with LLMs to produce accurate, grounded answers. The pattern you choose — naive, advanced, or agentic — depends on your data complexity and quality requirements.

Naive RAG

The simplest pattern: retrieve documents, concatenate with the question, generate an answer.

Retrieved documents:
[doc_1] Django is a high-level Python web framework that encourages rapid development...
[doc_2] Flask is a micro-framework for Python with a minimal core...
[doc_3] FastAPI is a modern Python web framework based on Starlette...

Question: What are the best Python web frameworks for 2026?

Answer using only the retrieved documents. If the documents don't
contain enough information, say so explicitly. Cite your sources.

Limitations: No query rewriting, no reranking, no handling of missing information. If the retrieval fails, the answer fails.

When naive is sufficient: Simple FAQ, small document sets (under 100 docs), demo applications, internal tools where 80% accuracy is acceptable.

Advanced RAG

Add pre-retrieval and post-retrieval steps to improve quality.

Pre-retrieval — Query rewriting:

Original query: "How do I set it up?"
Context: User was just reading about Django REST Framework
Rewritten: "How do I set up authentication in Django REST Framework 3.14?"

Rewrite the user's question into a clear, standalone search query.
Instructions:
- Expand abbreviations and acronyms
- Resolve pronouns ("it", "that", "this") by referencing conversation context
- Add domain-specific terms that improve matching
- Output only the rewritten query

Original: {user_query}
Conversation context: {context}
Search query:

Chunking strategies — how you split documents matters:

Strategy	Method	Best For
Fixed-size	Split every N characters	Simple content, logs
Semantic	Split at topic boundaries	Articles, documentation
Recursive	Split by paragraph → sentence → word	Mixed content
Hierarchical	Chunk + parent document reference	Long documents needing full-context answers

Chunk by section headings, not by character count.
Each chunk should be a self-contained unit of meaning.
Include the document title and section path in each chunk's metadata.
Target size: 500-1000 tokens per chunk.

Hybrid search: Combine vector (semantic) and keyword (BM25) retrieval for better coverage.

Search using both methods:
1. Vector search — finds semantically similar chunks
2. Keyword search — finds exact term matches

Merge results from both, deduplicate, and rerank by combined score.
Weight: 0.6 vector + 0.4 keyword (adjust based on your data).

Post-retrieval — Reranking:

Rank these documents by relevance to the query. Keep only the top 3.

Query: "Python async web frameworks"
Documents:
1. [doc_A: "Introduction to Python"] — relevance: low
2. [doc_B: "FastAPI async handlers"] — relevance: high
3. [doc_C: "Django ORM tutorial"] — relevance: medium
4. [doc_D: "AIOHTTP vs FastAPI"] — relevance: high
5. [doc_E: "Python packaging guide"] — relevance: low

Kept: doc_B, doc_D, doc_C

HyDE (Hypothetical Document Embeddings): Generate a hypothetical ideal document from the query, then use that to search.

Given the question, first generate a hypothetical document that
would perfectly answer it. Then use that document's embedding
to search for real documents.

Question: {question}
Hypothetical document:

Best for: Production Q&A, documentation search, any system requiring high precision.

Agentic RAG

The agent decides when and what to retrieve, using tools dynamically. It can refine searches, try different approaches, and know when it has enough information.

You are a research assistant with access to a knowledge base.
Retrieve information only when needed.

Available tools:
- search_knowledge_base(query: string) → [documents]

Rules:
1. First, try to answer from your own knowledge
2. If unsure, search the knowledge base
3. If search results are insufficient, refine your search
4. Cite sources for any retrieved information
5. Say "I couldn't find information on that" only after 2 search attempts

Remember: you can search multiple times with different queries.

Self-querying retrieval: The model extracts structured filters from natural language.

From the user's question, extract:
- Search query (the core information need)
- Filters (date range, category, author, version)
- Sort order (relevance, date, popularity)

Question: "Show me the latest articles about React Server Components from 2025"
→ Query: "React Server Components"
→ Filters: {year: 2025}
→ Sort: by date descending

When retrieval is insufficient, the agent should adapt:

Your first search returned low-quality results. Try these strategies:
1. Simplify the query (remove jargon)
2. Use synonyms for key terms
3. Split a complex query into multiple specific searches
4. If still failing, admit the gap rather than fabricating

Best for: Complex research, ambiguous queries, scenarios where the optimal retrieval strategy isn't known upfront.

Multi-Hop RAG

Chain multiple retrievals where each result informs the next query. Essential for questions that require connecting information across documents.

Question: "Which company developed the framework used by Instagram's backend?"

Hop 1: "What framework does Instagram use?"
→ Result: "Instagram uses Django"

Hop 2: "Which company developed Django?"
→ Result: "Django was created by the Django Software Foundation"

Answer: "Instagram uses Django, which was developed by the Django Software Foundation."

You need to answer a question that may require multiple searches.
Break your approach into hops:

Hop 1: Search for the initial answer
Hop 2: Use the result to formulate the next search
Hop 3+: Continue until you can answer confidently

Set a maximum of 5 hops. If you can't answer after 5 hops,
report what you found and what's still missing.

Current hop: {hop_number}
Previous findings: {findings}
Next search query:

Best for: Multi-step reasoning, entity linking, questions requiring synthesis across documents.

Handling Common RAG Failures

Failure	Symptom	Fix
Irrelevant retrieval	Answer uses unrelated docs	Improve chunking, add reranking step
Contradictory sources	Answer contains conflicting statements	Flag contradictions in output: "Source A says X, Source B says Y"
Outdated information	Answers reference old versions	Include date metadata, add freshness check in prompt
Missing information	Answer fabricates details	Tighten "only use retrieved docs" instruction, add refusal language
Too many documents	Oversized prompt, truncated output	Cap retrieved chunks at 3-5, use reranking

Citation & Attribution Strategies

Inline citation: Cite within the answer text.

Django is a high-level Python framework [1]. Flask is better for microservices [2].

[1] Django Documentation, https://docs.djangoproject.com
[2] Flask Documentation, https://flask.palletsprojects.com

When sources conflict:

Source A states the API rate limit is 100 requests per minute.
Source B states it is 1000 requests per minute.

This appears to be a version difference. Source A is for v2.0,
Source B is for v3.0. Please verify which version you are using.

Evaluating RAG Quality

Retrieval metrics:

Hit rate — Did we retrieve at least one relevant document?
Mean Reciprocal Rank (MRR) — How high was the first relevant result?
Normalized Discounted Cumulative Gain (NDCG) — Overall ranking quality

Generation metrics:

Faithfulness — Does the answer stick to retrieved documents?
Answer relevance — Does the answer address the question?
Completeness — Does the answer cover all aspects?

Pattern Selection

Pattern	Retrieval Quality	Latency	Complexity	Best For
Naive RAG	Low	Low	None	Simple FAQ, demos
Advanced RAG	Medium-High	Medium	Low-Medium	Production Q&A, docs
Agentic RAG	High	High	Medium	Complex research
Multi-Hop RAG	High	High	Medium	Entity linking, synthesis

Best Practices

Chunk strategically — Split by section boundaries, not character count. Each chunk should be self-contained.
Include metadata — Store source, date, version, and section path alongside each chunk.
Set relevance thresholds — Don't retrieve low-scoring documents; they add noise.
Handle empty results — Tell the user when nothing relevant exists. Never fabricate.
Cite sources — Always attribute information to its source document.
Test your retrieval — Measure hit rate on a held-out set of questions before deploying.
Version your documents — Multiple versions of the same doc cause confusion; use version metadata to retrieve the right one.

RAG Patterns: Retrieval-Augmented Generation

RAG Patterns

Naive RAG

Advanced RAG

Agentic RAG

Multi-Hop RAG

Handling Common RAG Failures

Citation & Attribution Strategies

Evaluating RAG Quality

Pattern Selection

Best Practices

Related Articles

Image Generation with ChatGPT

Needle-in-Haystack: Finding Specifics in Massive Claude Contexts

Master Sales Pitches with ChatGPT: Templates & Strategies

On this page