This document synthesizes research across 10 major agentic memory architectures, providing implementation details, performance benchmarks, and comparative analysis for production deployment.
- A-MEM (Zettelkasten-based)
- Reflexion (Verbal RL)
- MemGPT (OS-inspired)
- DSPy (Declarative Pipelines)
- Mem0 (Hybrid Memory Layer)
- AWM (Workflow Memory)
- StateFlow (FSM Control)
- ADAS/Meta Agent Search
- Generative Agents (Simulacra)
- MemoryBank (Ebbinghaus-inspired)
Paper: Xu et al., arXiv 2502.12110 (Feb 2025) Repository: https://github.com/agiresearch/A-mem
Dynamic memory organization inspired by the Zettelkasten method - creating interconnected knowledge networks through atomic notes and flexible linking.
┌─────────────────────────────────────────────────────────────────┐
│ A-MEM System │
├─────────────────────────────────────────────────────────────────┤
│ │
│ New Memory ──▶ Note Construction ──▶ Link Generation ──▶ Store │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Attributes │ │ Analyze │ │
│ │ - content │ │ Historical │ │
│ │ - keywords │ │ Memories │ │
│ │ - tags │ └─────────────┘ │
│ │ - context │ │ │
│ │ - embedding│ ▼ │
│ └─────────────┘ Memory Evolution │
│ (Update existing) │
└─────────────────────────────────────────────────────────────────┘
class MemoryNote:
id: str # Unique identifier
content: str # Raw memory content
keywords: List[str] # Extracted keywords
tags: List[str] # Categorical tags
context: str # LLM-generated contextual description
embedding: np.array # all-MiniLM-L6-v2 vector
links: List[str] # Connected memory IDs
created_at: datetime
updated_at: datetimedef generate_links(new_memory, historical_memories, threshold=0.7):
"""Two-stage link generation: semantic + LLM decision"""
# Stage 1: Cosine similarity filtering
candidates = []
for mem in historical_memories:
sim = cosine_similarity(new_memory.embedding, mem.embedding)
if sim > threshold:
candidates.append((mem, sim))
# Stage 2: LLM decision for meaningful connections
links = []
for candidate, sim in candidates:
prompt = f"""Determine if these memories should be linked:
Memory 1: {new_memory.content}
Memory 2: {candidate.content}
Should link? (yes/no):"""
if llm_call(prompt).strip().lower() == "yes":
links.append(candidate.id)
return linksWhen new memories are added, existing memories can be updated:
- Context descriptions refined based on new connections
- Tags/keywords expanded from new insights
- Link weights adjusted based on access patterns
| Benchmark | A-MEM | MemGPT | Improvement |
|---|---|---|---|
| Single Hop | 0.85 | 0.72 | +18% |
| Multi Hop | 0.78 | 0.39 | +100% |
| Token Usage | 1,200-2,500 | 16,900 | 85-93% reduction |
Ablation shows Link Generation (LG) + Memory Evolution (ME) provide 2.8x improvement on Single Hop tasks.
Paper: Shinn et al., NeurIPS 2023 (arXiv 2303.11366)
Self-reflection through verbal reinforcement - the agent generates natural language feedback about its failures and uses this to improve subsequent attempts.
┌─────────────────────────────────────────────────────────────────┐
│ Reflexion Loop │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ Actor │───▶│ Evaluator │───▶│ Self-Reflect │ │
│ └─────────┘ └───────────┘ └──────────────┘ │
│ ▲ │ │
│ │ ▼ │
│ │ ┌──────────────┐ │
│ └──────────────────────────│ Memory │ │
│ │ (Reflections)│ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
# Call 1: Error Identification
error_prompt = f"""
Previous attempt failed with error:
{error_message}
Code that failed:
{failed_code}
Identify what went wrong and why.
"""
error_analysis = llm_call(error_prompt)
# Call 2: Implementation Correction
fix_prompt = f"""
Based on this analysis:
{error_analysis}
Previous reflections:
{memory.get_reflections()}
Generate corrected implementation:
"""
corrected_code = llm_call(fix_prompt)- Sliding window: ~3 most recent reflections
- Natural language format: Human-readable failure analysis
- Accumulative: Reflections build on prior insights
| Task | Reflexion | Base GPT-4 |
|---|---|---|
| HumanEval | 88% pass@1 | 67% |
| ALFWorld | 97% | 75% |
| HotPotQA | 77% | 62% |
Paper: Packer et al., arXiv 2310.08560 (Oct 2023) Framework: Letta (https://github.com/letta-ai/letta)
Operating system metaphor - LLM manages its own memory through function calls, paging information between tiers as needed.
┌─────────────────────────────────────────────────────────────────┐
│ MemGPT Memory Hierarchy │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ CORE MEMORY ││
│ │ ┌──────────────────┐ ┌──────────────────┐ ││
│ │ │ Persona Block │ │ User Block │ ││
│ │ │ (Agent identity)│ │ (User profile) │ ││
│ │ └──────────────────┘ └──────────────────┘ ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ RECALL MEMORY ││
│ │ (Conversation history buffer) ││
│ └─────────────────────────────────────────────────────────────┘│
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────┐│
│ │ ARCHIVAL MEMORY ││
│ │ (Unlimited external storage) ││
│ └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
# Core memory operations
def core_memory_append(label: str, content: str):
"""Add to persona or user block"""
def core_memory_replace(label: str, old: str, new: str):
"""Edit existing core memory"""
# Archival operations
def archival_memory_insert(content: str):
"""Store in long-term archival"""
def archival_memory_search(query: str, page: int = 0):
"""Retrieve from archival with pagination"""
# Conversation operations
def conversation_search(query: str, page: int = 0):
"""Search conversation history"""# Agent can request additional processing steps
response = agent.step(user_message)
if response.request_heartbeat:
# Continue processing without user input
response = agent.step(None) # Internal continuationPaper: Khattab et al., NeurIPS 2023 (arXiv 2310.03714) Repository: https://github.com/stanfordnlp/dspy
Programming model that abstracts LM pipelines as text transformation graphs. Treats prompting as an optimization problem rather than manual engineering.
# 1. SIGNATURES - Declarative I/O
"question -> answer"
class RAG(dspy.Signature):
"""Answer questions with retrieved context."""
context = dspy.InputField()
question = dspy.InputField()
answer = dspy.OutputField()
# 2. MODULES - Parameterized components
class MyRAG(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought(RAG)
def forward(self, question):
context = self.retrieve(question)
return self.generate(context=context, question=question)
# 3. TELEPROMPTERS - Optimization algorithms
optimizer = dspy.MIPROv2()
optimized_rag = optimizer.compile(
MyRAG(),
trainset=train_data,
metric=accuracy
)- Grounded Proposal: Generate candidate instructions/demonstrations
- Discrete Search: Explore combinations
- Surrogate Model: Learn to predict quality
| Model | Before | After | Task |
|---|---|---|---|
| GPT-3.5 | 33% | 82% | Case Study 1 |
| GPT-3.5 | 32% | 46% | Case Study 2 |
| Llama2-13b | 9% | 47% | Case Study 1 |
| T5-770M | - | Competitive with GPT-3.5 | General |
Compilation time: Minutes to tens of minutes
Paper: Chhikara et al., arXiv 2504.19413 (April 2025) Repository: https://github.com/mem0ai/mem0
Hybrid data store combining vector database + graph database + key-value store for comprehensive memory management.
┌─────────────────────────────────────────────────────────────────┐
│ Mem0 Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ │
│ │ Vector Store │ │ Graph Store │ │ KV Store │ │
│ │ (Semantic) │ │ (Relations) │ │ (Facts) │ │
│ └───────────────┘ └───────────────┘ └───────────────┘ │
│ │ │ │ │
│ └─────────────────┴──────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ LLM Processor │ │
│ │ (CRUD + Update) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Operation | Description |
|---|---|
| ADD | Insert new fact |
| UPDATE | Modify existing memory |
| DELETE | Remove obsolete |
| NO-OP | Skip duplicate |
# Extraction Phase
entities = EntityExtractor(message) # → Nodes
relations = RelationsGenerator(entity_pairs) # → Labeled edges
# Edge labels: 'lives_in', 'prefers', 'owns', 'happened_on'
# Update Phase
conflicts = ConflictDetector(new_triples, existing_graph)
updates = UpdateResolver(conflicts) # → {add, merge, invalidate, skip}
# Marks invalid rather than deleting (temporal reasoning)
# Dual Retrieval
# 1. Entity-centric: entity → similarity → traverse → subgraph
# 2. Semantic triplet: query embedding → match triplet embeddingsfrom mem0 import Memory
config = {
"graph_store": {
"provider": "neo4j",
"config": {
"url": os.environ["NEO4J_URL"],
"username": "neo4j",
"password": os.environ["NEO4J_PASSWORD"]
}
}
}
memory = Memory.from_config(config)
memory.add("Alice met Bob at GraphConf 2025", user_id="demo-user")
results = memory.search("Who did Alice meet?", user_id="demo-user")| Scope | Persistence | Use Case |
|---|---|---|
user_id |
Across all conversations | User preferences |
session_id |
Single conversation | Current context |
agent_id |
Per agent instance | Agent-specific knowledge |
- 26% higher accuracy vs OpenAI memory
- 91% lower p95 latency vs full-context
- 90% token savings
- Mem0ᵍ: ~2% higher overall score than base Mem0
Paper: Wang et al., arXiv 2409.07429 (Sep 2024) Repository: https://github.com/zorazrw/agent-workflow-memory
Induces reusable workflows from agent trajectories, stores in memory, and guides future task-solving through hierarchical composition.
I(E_train) → W_offline # Offline: induce from training examples
L(q, M+W, o_test) → a_test # Utilize workflows at inference
| Mode | Description | Best For |
|---|---|---|
| Offline | Pre-induce workflows from training set | Known task distributions |
| Online | Streaming induce→integrate→utilize | Novel task discovery |
class Workflow:
name: str # "find_place_by_name"
description: str # Natural language
steps: List[Action] # Primitive action sequence
# Hierarchical composition:
# Level 1 (Primitives): "click", "type"
# Level 2 (Induced): "find_place_by_name"
# Level 3 (Composite): "get_place_zipcode" (uses Level 2)Query 1 → Solve → Induce W₁ → Memory
Query 2 → Solve (with W₁) → Induce W₂ (builds on W₁) → Memory
...
# Simple workflows become building blocks for complex ones
| Benchmark | AWM vs Baseline | Improvement |
|---|---|---|
| Mind2Web | +24.6% | Relative success rate |
| WebArena | +51.1% | Relative success rate |
| Steps | 7.9 → 5.9 | Average steps reduced |
Paper: Wu et al., arXiv 2403.11322 (Mar 2024) Integration: AutoGen GroupChat
Models LLM workflows as Finite State Machines, distinguishing process grounding (states/transitions) from sub-task solving (actions within states).
┌─────────────────────────────────────────────────────────────────┐
│ StateFlow FSM Model │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌───────┐ success ┌─────────┐ success ┌─────────┐ │
│ │ Init │──────────────▶│ Observe │─────────────▶│ Solve │ │
│ └───────┘ └─────────┘ └─────────┘ │
│ │ │ │
│ │ error │ error │
│ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ │
│ │ Error │◀────────────│ Verify │ │
│ └─────────┘ └─────────┘ │
│ │ │
│ │success │
│ ▼ │
│ ┌─────────┐ │
│ │ End │ │
│ └─────────┘ │
└─────────────────────────────────────────────────────────────────┘
| Variant | Description | Context Management |
|---|---|---|
| StateFlow | Single LLM, different instructions per state | Instructions in context |
| SF_Agent | Different LLM agents per state | No context bloat |
import autogen
def state_transition(last_speaker, groupchat):
messages = groupchat.messages
if last_speaker is initializer:
return coder # Init → Retrieve
elif last_speaker is coder:
return executor # Retrieve action
elif last_speaker is executor:
if "exitcode: 1" in messages[-1]["content"]:
return coder # Error → Retry
else:
return scientist # Success → Research
elif last_speaker == scientist:
return None # Research → End
groupchat = autogen.GroupChat(
agents=[initializer, coder, executor, scientist],
messages=[],
max_round=20,
speaker_selection_method=state_transition,
)| Benchmark | vs ReAct | Cost Reduction |
|---|---|---|
| InterCode SQL | +13% | 5x less |
| ALFWorld | +28% | 3x less |
Key Insight: Combines with Reflexion for further improvement.
Paper: Hu et al., ICLR 2025 (arXiv 2408.08435) Repository: https://github.com/ShengranHu/ADAS
Meta Agent Search - a meta agent iteratively programs new agents in code, evaluates them, and builds an archive of discoveries to inform future iterations.
┌─────────────────────────────────────────────────────────────────┐
│ ADAS Components │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. SEARCH SPACE: Code (Turing Complete) │
│ → Can represent ANY agentic system │
│ │
│ 2. SEARCH ALGORITHM: Meta Agent Search │
│ Archive → Meta Agent → New Code → Evaluate → Archive │
│ │
│ 3. EVALUATION: Task accuracy on benchmarks │
│ │
└─────────────────────────────────────────────────────────────────┘
def meta_agent_search(iterations=25):
# Seed with baseline agents
archive = [CoT, CoT_SC, SelfRefine, LLM_Debate]
for i in range(iterations):
# Meta agent (GPT-4) generates new agent code
agent_code = meta_agent.generate(
archive=archive,
instruction="Create novel, interesting agent"
)
# Evaluate on benchmark (using GPT-3.5)
performance = evaluate(agent_code, benchmark)
# Add if novel and performant
if is_novel(agent_code, archive) and performance > threshold:
archive.add(agent_code, performance)
return best_agent(archive)| Domain | Improvement |
|---|---|
| DROP (F1) | +13.6/100 |
| MGSM (accuracy) | +14.4% |
| GSM8K (transfer) | +25.9% |
| GSM-Hard (transfer) | +13.2% |
Agents discovered in math domain transfer to:
- Reading comprehension
- Science questions
- Multi-task problems
This suggests ADAS discovers general design patterns, not task-specific tricks.
Paper: Park et al., UIST 2023 (arXiv 2304.03442) Repository: https://github.com/joonspk-research/generative_agents
Computational agents that simulate believable human behavior through memory streams, reflection, and planning - demonstrated in a Sims-like sandbox environment.
┌─────────────────────────────────────────────────────────────────┐
│ Generative Agent Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Perception ──▶ Memory Stream ──▶ Retrieval ──▶ Action │
│ │ ▲ │
│ ▼ │ │
│ Reflection ─────────┘ │
│ │ │
│ ▼ │
│ Planning │
│ │
└─────────────────────────────────────────────────────────────────┘
Complete record of agent experiences in natural language:
class MemoryObject:
description: str # Natural language observation
created_at: datetime
last_accessed: datetime
importance: float # 1-10 scale (LLM-assigned)def retrieve(agent, query, k=10):
"""Retrieve memories based on recency, importance, relevance"""
for memory in agent.memory_stream:
# Recency: exponential decay
recency = exp(-decay * hours_since_access(memory))
# Importance: LLM-assigned score
importance = memory.importance / 10
# Relevance: embedding similarity
relevance = cosine_sim(embed(query), embed(memory.description))
# Combined score (equal weighting)
memory.score = (recency + importance + relevance) / 3
return sorted(memories, key=lambda m: m.score)[:k]Triggered when sum of importance scores exceeds threshold (~150):
def reflect(agent):
# 1. Get recent memories
recent = agent.memory_stream[-100:]
# 2. Generate salient questions
questions = llm_call(f"""
Given these observations:
{recent}
What are 3 most salient high-level questions?
""")
# 3. Retrieve relevant memories per question
for question in questions:
relevant = retrieve(agent, question)
# 4. Generate insights
insight = llm_call(f"""
Statements: {relevant}
Question: {question}
What 5 high-level insights can you infer?
""")
# 5. Store reflection as new memory
agent.memory_stream.add(MemoryObject(
description=insight,
importance=8, # Reflections are important
is_reflection=True
))From a single seed ("Isabella wants to throw a Valentine's Day party"):
- Agents autonomously spread invitations over 2 days
- Made new acquaintances
- Asked each other on dates
- Coordinated to arrive together at correct time
| Condition | Believability Score |
|---|---|
| Full architecture | Best |
| No reflection | Significant drop |
| No planning | Significant drop |
| No observation | Worst |
Paper: Zhong et al., AAAI 2024 (arXiv 2305.10250) Repository: https://github.com/zhongwanjun/MemoryBank-SiliconFriend
Human-like memory mechanism inspired by Ebbinghaus Forgetting Curve - memories decay over time but are reinforced through access and importance.
┌─────────────────────────────────────────────────────────────────┐
│ MemoryBank Architecture │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Conversation ──▶ Event Extraction ──▶ Memory Storage │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Ebbinghaus Decay│ │
│ │ + Importance │ │
│ └─────────────────┘ │
│ │ │
│ ▼ │
│ Response ◀── Memory Retrieval ◀── User Portrait │
│ │
└─────────────────────────────────────────────────────────────────┘
def memory_strength(memory, current_time):
"""Ebbinghaus-inspired decay with importance weighting"""
# Base retention from forgetting curve
time_elapsed = current_time - memory.last_accessed
retention = exp(-time_elapsed / memory.stability)
# Importance factor
importance_weight = memory.importance / 10
# Reinforcement from access count
reinforcement = log(1 + memory.access_count)
return retention * importance_weight * reinforcement| Component | Description |
|---|---|
| Event Summaries | Extracted key events from conversations |
| User Portrait | Synthesized personality understanding |
| Memory Index | Encoded representations for retrieval |
AI companion chatbot tuned on 38K psychological dialogs:
- Empathetic responses
- Personality understanding
- Long-term relationship building
| Metric | SiliconFriend | Baseline |
|---|---|---|
| Memory Retrieval Accuracy | High | - |
| Response Correctness | Improved | - |
| Empathy Score | Significantly higher | - |
Works with both:
- Closed-source: ChatGPT
- Open-source: ChatGLM, BELLE
| Method | Memory Type | Organization | Retrieval | Evolution |
|---|---|---|---|---|
| A-MEM | Knowledge graph | Dynamic linking | Semantic + graph | LLM-driven updates |
| Reflexion | Episodic | Verbal summaries | Context injection | Accumulates |
| MemGPT | Tiered (OS-style) | Main/External | Function calls | Self-editing |
| DSPy | Compiled traces | Teleprompter opt | N/A (compiled) | Optimization |
| Mem0 | Hybrid (V+G+KV) | Entity relations | Dual retrieval | LLM CRUD |
| AWM | Workflow sequences | Hierarchical | Rule/LM-based | Snowball |
| StateFlow | Context history | FSM states | State-based | N/A |
| ADAS | Agent code | Archive | Meta-level | Iterative |
| Generative Agents | Memory stream | Time-indexed | Recency+Importance+Relevance | Reflection |
| MemoryBank | Episodic + Portrait | Ebbinghaus decay | Importance-weighted | Forgetting curve |
| Method | Typical Context Usage | Efficiency |
|---|---|---|
| A-MEM | 1,200-2,500 | Best |
| Mem0 | 90% savings | Excellent |
| StateFlow | 3-5x reduction | Excellent |
| MemoryBank | Variable | Good |
| MemGPT | 16,900 | Moderate |
| Reflexion | Variable | Moderate |
| Method | Complexity | Dependencies | Best Starting Point |
|---|---|---|---|
| Reflexion | Low | LLM only | Immediate |
| MemoryBank | Low | LLM + storage | Immediate |
| Mem0 | Low-Medium | Neo4j + vector DB | Production apps |
| DSPy | Medium | DSPy library | Pipeline optimization |
| A-MEM | Medium | ChromaDB + LLM | Knowledge-intensive |
| StateFlow | Medium | AutoGen | Sequential tasks |
| MemGPT | Medium | Letta framework | Long conversations |
| AWM | Medium | Custom impl | Web automation |
| Generative Agents | High | Custom sandbox | Simulation research |
| ADAS | High | Meta-agent infra | Agent discovery research |
| Need | Recommended | Why |
|---|---|---|
| Production user memory | Mem0 | 26% accuracy boost, production-ready |
| Complex reasoning chains | A-MEM | 2x better multi-hop performance |
| Web automation | AWM | 51% improvement on WebArena |
| Sequential task control | StateFlow | 5x cost reduction |
| Pipeline optimization | DSPy | 33%→82% quality improvement |
| Long-term companionship | MemoryBank | Ebbinghaus-based, empathetic |
| Social simulation | Generative Agents | Emergent behaviors |
| Learning from failures | Reflexion | Simple, effective |
| Unbounded context | MemGPT | OS-inspired virtual memory |
| Discovering new architectures | ADAS | Meta-agent search |
- Start with Reflexion - Simplest to implement, immediate benefits
- Add Mem0 - Production-ready persistent memory
- Integrate StateFlow - When control flow matters
- Consider A-MEM - For knowledge-intensive applications
- Explore ADAS - For cutting-edge agent discovery
| Paper ID | Title | Year |
|---|---|---|
| 2502.12110 | A-MEM: Agentic Memory for LLM Agents | 2025 |
| 2303.11366 | Reflexion: Verbal Reinforcement Learning | 2023 |
| 2310.08560 | MemGPT: Virtual Memory for LLMs | 2023 |
| 2310.03714 | DSPy: Compiling Declarative Language Programs | 2023 |
| 2504.19413 | Mem0: Universal Memory Layer | 2025 |
| 2409.07429 | Agent Workflow Memory | 2024 |
| 2403.11322 | StateFlow: State-Driven Workflows | 2024 |
| 2408.08435 | ADAS: Automated Design of Agentic Systems | 2025 |
| 2304.03442 | Generative Agents: Interactive Simulacra | 2023 |
| 2305.10250 | MemoryBank: Long-Term Memory for LLMs | 2023 |
Generated: January 2026