Agent Memory Systems

Memory transforms stateless AI calls into stateful agents. The right memory strategy depends on your context window budget, task duration, and whether the agent needs to remember across sessions.

Context Window Management

The simplest memory — keep recent conversation in context. The challenge is fitting relevant information into the available window.

Sliding Window: Keep the last N turns. Oldest messages are dropped first.

System: You are a helpful assistant. Keep the last 5 messages in context.

Conversation history (last 5 turns):
User: What's the capital of France?
Assistant: Paris.
User: What's its population?
Assistant: About 2.1 million.
User: What about the metro area?
Assistant: Over 13 million.
User: What's the best time to visit?
Assistant:

Window size depends on your context budget. For a 128K model, a window of 50 turns costs roughly 15-20K tokens. For an 8K model, keep it to 5-10 turns.

Summarization: Periodically summarize older context into a condensed form.

When the conversation exceeds 10 messages:
1. Summarize the key facts and decisions from messages 1-10
2. Replace those messages with: [Summary: {summary}]
3. Continue with the current conversation

Current summary: The user is planning a trip to Paris and wants
information about attractions, transportation, and accommodation.
Target budget: Keep summaries under 500 tokens.

Selective Retention: Keep important facts while discarding trivial details.

Maintain a "key facts" section that updates after each exchange:
- User's name: {name}
- Current project: {project}
- Important preferences: {preferences}
- Decisions made: {decisions}
- Pending action items: {items}

Discard: greetings, confirmations, small talk, off-topic tangents.
Update the key facts section every 3 messages.

Hybrid approach: Combine window + summary for best results.

Maintain two memory sections:
1. Recent (last 10 messages) — verbatim, no summarization
2. Long-term summary — condensed version of everything before the window

When composing a response, use both sections as context.
Update the long-term summary when it grows beyond 1000 tokens.

Memory budget calculation:

Model Context	Window Size	Summary Budget	Total Budget
8K tokens	10 turns (~4K)	500 tokens	4.5K (56%)
32K tokens	25 turns (~10K)	1K tokens	11K (34%)
128K tokens	50 turns (~20K)	2K tokens	22K (17%)

RAG as Memory

Retrieve relevant information from a knowledge base when needed. Best for factual recall that doesn't fit in the context window.

Before answering, search your knowledge base for relevant information:

1. Convert the user's question into a search query
2. Retrieve the top 3 most relevant documents
3. Use the retrieved information to answer

Retrieved documents:
{document_1}
{document_2}
{document_3}

User question: {question}

Real-world example — documentation chatbot with version awareness:

Tools:
- search_docs(query: string, product_version: string) → [doc_chunks]

User: "How do I set up authentication in v3?"

1. search_docs("authentication setup", "v3.0")
   → Retrieved 3 chunks about auth configuration

2. Answer using those chunks, noting the version.
   If the user asks about a different version later,
   retrieve from that version's docs instead.

Cache invalidation: RAG memory can become stale. Include freshness checks.

For each retrieved document, check its last_updated date.
If the document is older than 30 days, add a caveat:
"This information was last updated on {date}. Verify against current docs."

Best for: Factual Q&A, documentation assistants, customer support with knowledge bases, any scenario with more information than fits in context.

Episodic Memory

Store and recall specific past interactions — what worked, what didn't, user preferences.

You have interacted with this user before. Here are relevant past interactions:

Past session 1 (2 days ago):
- Project: React dashboard with real-time data
- Preferred: TypeScript, Zustand for state, Tailwind CSS
- Disliked: Redux, class components, verbose comments

Past session 2 (yesterday):
- Asked about chart libraries
- Recommended: Recharts for simplicity
- User chose: Nivo for animation support

Use this history to tailor your recommendations and tone.

When a user returns after a previous session:
1. Greet them by name and reference their last project
2. Ask if they want to continue where they left off
3. Suggest next steps based on previous work
4. Do not assume — ask if their priorities have changed

Session merging: When relevant facts accumulate across sessions, merge them into a single profile.

Merge new information from this session into the user's profile:
- New facts learned this session: {new_facts}
- Contradictions with stored facts: {contradictions}
- Resolve contradictions by asking the user which is correct

Best for: Personal assistants, coding companions, any system that benefits from knowing user history.

Persistent Memory

Long-term storage using external databases, knowledge graphs, or files. Survives across sessions and even across different agent instances.

Knowledge Graph: Store entities and their relationships.

Maintain a knowledge graph of user information:
- Entities: people, projects, preferences, facts, dates
- Relationships: "works on", "prefers", "located in", "mentioned in"

When new information arrives:
1. Extract entities and relationships from the conversation
2. Check if entities already exist in the graph
3. Update existing entities or create new ones
4. Resolve conflicts by asking the user

Graph state (current session):
- User: Bruce → works on → Prompt Genius
- User: Bruce → prefers → TypeScript
- Project: Prompt Genius → uses → Next.js
- Project: Prompt Genius → uses → Cloudflare Pages

File-based memory: Save and load from structured files. Simpler than a graph, good for small-scale use.

You have access to a persistent memory file: {user_id}_memory.json

On session start:
1. Load {user_id}_memory.json
2. Parse the JSON into your working context

During the session, track:
- New facts learned (with timestamps)
- Changes to existing facts
- Action items created or completed
- User feedback on your responses

On session end:
1. Update {user_id}_memory.json with new information
2. Do not overwrite — merge changes carefully
3. Keep the file under 100KB to avoid token waste

Example schema:
{
  "user_name": "Bruce",
  "projects": [{"name": "Prompt Genius", "tech_stack": ["Next.js", "Cloudflare"]}],
  "preferences": {"verbosity": "concise", "tone": "direct"},
  "session_history": [{"date": "2026-05-01", "summary": "Worked on routing config"}],
  "action_items": [{"task": "Deploy v2.1", "status": "pending"}]
}

Best for: Long-term assistants, personal knowledge bases, applications where users expect the system to remember them across days or weeks.

Privacy & Security

Memory storage introduces data risks. Address them explicitly.

Data retention policy — Define how long memories are stored. Provide a "forget me" command.
User consent — Ask before storing personal information. "Would you like me to remember your preferred framework?"
Memory scrubbing — Allow users to view, edit, or delete stored memories.
Cross-session leakage — Ensure one user's memories don't leak into another user's session. Use strict user_id isolation.
Sensitive data — Never store passwords, API keys, or personal identifiable information in memory unless encrypted.

Forget command: "Forget everything about {topic}"
If the user asks to forget, remove all related entities and facts from storage.

Memory Budget Planning

Choose your strategy based on available context window:

Memory Approach	Token Cost per Turn	Setup Complexity	Cross-Session
Sliding window	~50-100 tokens (per message)	None	No
Summarization	~200-500 tokens (periodic)	Low	No
RAG retrieval	~500-1500 tokens (per retrieval)	Medium	Yes
Episodic (summary injection)	~300-800 tokens (session start)	Medium	Yes
Knowledge graph	~200-500 tokens (query results)	High	Yes
File-based JSON	~500-2000 tokens (full load)	Low	Yes

For an 8K model, stick to sliding window + optional summarization. For 128K+ models, you can afford RAG plus episodic memory.

Hybrid Architecture Example

A customer support agent using all 4 memory types:

Memory layers (checked in order):

1. Context window (last 10 messages) — recent conversation state
   → "User just said the payment failed"

2. Episodic memory (past sessions)
   → "User had the same issue 3 days ago. Resolution was card expiration."

3. RAG (knowledge base)
   → "Payment failure docs: check card, check balance, check network"

4. Persistent memory (user profile)
   → "User prefers email confirmations, not SMS"

Decision flow:
- Use context window for immediate replies
- Check episodic if user says "this happened before"
- Query RAG for technical answers
- Reference persistent memory for personalization

Best Practices

Layer your memory - Use context window for recent, RAG for facts, persistent storage for long-term.
Be selective - More memory isn't always better; noise hurts quality.
Update proactively - Refresh stored information when it becomes stale.
Respect privacy - Give users control over what's stored and for how long.
Handle misses - When memory is empty or irrelevant, fall back to general knowledge.
Budget proactively - Calculate token costs before choosing a strategy, not after.

Agent Memory Systems