What agent memory actually is
Agent memory at the depth engineers need before betting a stack on it. The architectural difference between vector, graph, and OS-style approaches; how Letta / Mem0 / Zep / Graphiti / MCP- memory differ in practice; the retrieval flow that determines whether the agent remembers correctly or hallucinates with confidence; the conditions under which memory genuinely helps and the conditions under which it actively hurts.
What agent memory actually means
Agent memory is persistent state about prior interactions that informs the agent's reasoning on subsequent calls. The defining property is the persistence across the model's context-window boundary. A 128K-token context-window LLM is not “memory-enabled” just because it can hold a long conversation in its window. Memory is what survives when the conversation ends and a new one starts hours, days, or weeks later.
Most discussions of agent memory conflate three layers that should be treated separately. The most common definitional mistake — including in vendor docs from major frameworks — is to call any of these three “memory” without qualifying which:
- Short-term memory — content held inside the LLM's active context window during a single inference call. Sometimes called “working memory.” Lives in VRAM as the KV cache; vanishes when the request returns.
- Conversation memory — chat history within a session. Usually held in a frontend's database (Open WebUI's SQLite, AnythingLLM's LanceDB, etc.) and replayed into the model's context on each new turn.
- Long-term memory — state that persists across sessions, gets compacted and consolidated, and is retrieved selectively rather than replayed wholesale. This is what tools like Letta, Mem0, Zep, and Graphiti are actually about.
For the rest of this page, “agent memory” means long-term memory unless explicitly qualified. The first two layers are well-understood and aren't the place where architectural decisions matter. The third layer is where the ecosystem currently lacks operational clarity.
What it does not mean
Three patterns get called “agent memory” in marketing copy and aren't:
- Long context windows — Claude 3.5's 200K window or Gemini 2.5's 2M window. These are working- memory extensions, not long-term memory. They vanish at session boundary.
- Vector RAG over documents — what AnythingLLM does. RAG is *retrieval over a corpus you ingested*; memory is *retrieval over things the agent itself observed*. RAG answers “what does the doc say”; memory answers “what did we decide last Tuesday.”
- Fine-tuning — burning knowledge into the weights themselves. Different operating point: high persistence, low updateability. Memory frameworks add persistence without retraining.
Architecture: short-term vs long-term, vector vs graph, OS-style
The architectural decisions that actually matter when picking a memory framework. Three axes:
Storage shape — vector vs graph. Vector memory stores observations as embeddings in a vector store ([LanceDB](/tools/lancedb), [Qdrant](/tools/qdrant), Chroma). Retrieval is similarity search: given a query, find the K most similar past observations. Graph memory stores observations as nodes and relationships in a graph DB (Neo4j, in-memory variants), with explicit semantic structure: “the user decided X because of Y at time T.” Retrieval is graph traversal: given a query, walk N hops to find related context. Vector wins on simplicity and speed; graph wins on multi-hop reasoning and explicit causality.
Consolidation strategy — implicit vs explicit. Implicit consolidation (Mem0's default) runs an LLM pass over recent episodes at session boundaries and writes consolidated insights back to memory. Explicit consolidation (Letta's default) treats memory as an OS-style address space the agent itself manages — the agent decides when to archive, when to compress, when to evict. Implicit is faster to wire but harder to audit when memory drifts; explicit is more predictable but requires the agent to be smart about its own memory state.
Storage location — local vs hosted. Local memory lives in the agent's own database (LanceDB on disk, Postgres for structured facts, JSON for graph). Hosted memory lives in a managed service (Zep's cloud, Mem0's cloud tier). Local wins on privacy and operational simplicity; hosted wins on cross-machine continuity and managed maintenance.
The three axes combine into a 2×2×2 = 8 possible architectural shapes. The four that matter in practice:
Vector + implicit + local — the [Mem0](/tools/mem0) shape with [LanceDB](/tools/lancedb). Most popular; lowest friction. Drop-in API.
Vector + implicit + hosted — Mem0 cloud tier or [Zep](/tools/zep) basic tier. Drop-in API but with cross-machine continuity.
Graph + explicit + local — [Letta](/tools/letta). OS-style memory hierarchy with explicit paging. Strongest for long-horizon tasks where the agent needs to reason about its own memory state.
Graph + implicit + local — [Graphiti](/tools/graphiti) with Neo4j. Multi-hop temporal reasoning over consolidated memory; the right pick when “what did Bob decide about authentication three sessions ago and why” is the shape of question you ask.
Workflow: how a query traverses the memory hierarchy
A concrete example — an agent receives a task and queries memory before reasoning:
- Task ingestion. “Apply the auth-test fix you proposed yesterday.” The agent doesn't know what was proposed yesterday yet — that's in long-term memory.
- Episodic retrieval (vector). The agent embeds the task description, runs vector similarity search over past episode summaries. Returns the 3-5 most similar past episodes. The Mem0 / Zep / Graphiti happy path.
- Semantic retrieval (vector or graph). “What general patterns about auth-test fixes have we consolidated?” Returns consolidated insights from prior episodes. This is where Mem0g / Zep's temporal graph beats flat vector retrieval — multi-hop synthesis over consolidated facts.
- Structured retrieval (SQL or graph). “What were the specific files / commits / test names from yesterday's episode?” This is where [MCP-postgres](/tools/mcp-server-postgres) or a graph DB beats vector retrieval — exact lookup, not similarity.
- Repo-state grounding. “Has the proposed fix actually been committed?” This is the critical cross-check via [MCP-git](/tools/mcp-server-git) — episodic memory says one thing, the actual repo state may say another. The agent should trust the repo when they disagree.
- Reasoning + action. The agent synthesizes retrieved memory + repo state, plans, executes via tool calls.
- Episode write-back. At session end, the agent (or the framework) writes a summary of what happened back to long-term memory. Implicit consolidation frameworks compress episodes into semantic facts on a schedule; explicit consolidation requires the agent to decide what to keep.
Retrieval flow: episodic / semantic / structured / repo-state
The architectural rule that matters most for getting agent memory right: different memory questions deserve different retrieval strategies. Treating all memory as one vector store is the most common failure pattern in production agent deployments.
Episodic memory (“what happened”) → vector similarity. “Find similar past sessions” maps cleanly to “find similar embedding vectors.” Mem0 / Zep / Graphiti are all good at this.
Semantic memory (“what general patterns”) → either vector consolidation or graph traversal. Vector consolidation (Mem0g) compresses episodes into consolidated insight strings; graph traversal (Graphiti) walks the temporal knowledge graph. Graph wins when multi-hop reasoning is required; vector wins when consolidation prompts can capture the patterns.
Structured memory (“exact facts”) → SQL / graph DB. “What was the file path” or “list all decisions tagged auth” doesn't benefit from similarity search; you want exact lookup. MCP-postgres is the canonical local path; managed services (Zep's structured facts API) work too.
Repo-state grounding (“what is true now”) → MCP-git or live filesystem reads. Memory layers carry stale state; the repo carries current state. Always cross-check memory claims against ground truth before destructive actions.
Memory compaction — and why it can make memory worse
Compaction is the consolidation step that turns episodic memory into semantic memory. An LLM pass reads N recent episodes and writes consolidated insights back to memory: “the agent always handles auth-token expiry via the refresh path,” “the codebase prefers tap.test over describe/it.”
The honest problem: compaction is also the place where memory systems hallucinate confidently. The LLM doing the consolidation can:
- Generalize too aggressively — three episodes of one specific bug become a confident claim that “the codebase has bug pattern X.”
- Miss negation — “we decided NOT to do X” gets consolidated as “we decided to do X.” This happens often enough to require explicit audit.
- Drift on technical accuracy — the consolidation prompt sees only summaries, not full transcripts, so technical details get garbled.
- Lose causal structure — “X because Y” becomes “X and Y” in the consolidated output.
The mitigation pattern that works: audit consolidated memory at least monthly; treat it as a cache, not a source of truth. When the agent asserts a consolidated fact, verify against recent episodes or ground truth before acting on it. Disable consolidation entirely if your agent is consistently being confidently wrong.
Privacy implications
Agent memory carries the strongest data-residency implications of any agent-architecture decision. Three honest observations:
Memory is sticky. Once an agent has processed sensitive data and consolidated it into long-term memory, that data is in the memory store indefinitely. There's no straightforward “forget this” — you have to wipe and rebuild. Plan the threat model accordingly.
Hosted memory crosses trust boundaries. Zep cloud, Mem0 cloud — your agent state lives on a third-party service. For most use cases, fine. For regulated industries or sensitive workloads, deal-breaker. Pick local memory (Letta, Mem0 OSS, Graphiti) when this matters.
Compaction is a leak surface. Consolidation prompts run on the LLM provider. If you're using a cloud API for the consolidation pass, episodes leave your network via that API call — even if the consolidated state stays local. For true air-gapped operation, run consolidation on a local model.
Latency math
Memory retrieval is on the critical path of every agent task. The latency budget matters. Numbers to plan around:
- Vector retrieval (LanceDB, <100K vectors): ~30-100ms per query.
- Vector retrieval (Qdrant, 100K-1M vectors): ~50-200ms per query.
- Graph retrieval (Neo4j 2-hop): ~50-300ms depending on graph size.
- Letta paged-memory retrieval: ~100-300ms. Higher than vector because of the explicit paging logic.
- MCP-postgres exact lookup: ~10-50ms for indexed queries; ~100-500ms for table scans.
- Consolidation pass: 1-10 seconds depending on episode count and model size. Run async at session boundary; never synchronously during a task.
A typical memory-enabled agent makes 5-15 retrieval calls per task. At 100ms per retrieval, that's 500-1500ms of memory latency before the LLM does any thinking. This adds up on long agent loops.
Failure modes you'll hit
- Memory drift between sessions. Episodic memory says one thing; the actual repo / database state says another. The agent confidently reasons against stale knowledge. The single most common production failure mode. Mitigate via repo-state grounding before destructive actions.
- Confident hallucination from consolidation. See compaction section above. Consolidation can make memory actively worse than no memory. Audit consolidated memory monthly; disable consolidation if it's adding noise.
- Embedding model drift. Changing the embedding model after ingestion makes existing collections unreadable. Pin the embedding model; rebuild memory if you change it.
- Context-window exhaustion from memory injection. 5 retrievals × 500 tokens each + system prompt = 3-5K tokens before the agent reasons. With 8K context, half the window burned before thinking. Use 32K+ context models.
- Memory store corruption. Vector stores corrupt quietly. Take periodic snapshots of the memory directory; never rely on a single live store.
- Ambiguous time references. “Yesterday's session” means different things on different days. Memory frameworks should attach timestamps to episodes; agents should query against absolute time, not relative.
- Negation handling failure. Memory systems consistently struggle with negation: “we decided NOT to do X” consolidates as “X.” Audit negative episodes specifically.
- Multi-agent memory crosstalk. Two agents sharing the same memory store can poison each other's state. Per-agent memory isolation is non-optional in multi-agent deployments.
Letta / Mem0 / Zep / Graphiti / MCP-memory compared
The five memory frameworks worth considering for local AI in May 2026, with the operator notes that matter:
[Letta](/tools/letta). OS-style explicit memory hierarchy. Working memory (in context), archival memory (paged in/out), explicit memory blocks the agent itself manages. The strongest pick when you need deterministic memory behavior — the agent knows what's in memory and when. Heavier wiring than Mem0; harder to use casually but more powerful for long-horizon tasks. Local-first; OSS.
[Mem0](/tools/mem0). Drop-in vector memory with implicit consolidation. The fastest path from zero to working memory. 20 lines of config; works against any OAI-compatible LLM. Default for [/stacks/local-coding-agent](/stacks/local-coding-agent). Trade-off: implicit consolidation makes memory state opaque — when something goes wrong, harder to debug than Letta.
[Zep](/tools/zep). Temporal knowledge graph memory. Hosted product with strong API; OSS core available but hosted cloud is the canonical experience. Stronger than Mem0 on multi-hop reasoning (“what did Bob decide three sessions ago and why”); slower per-query than vector retrieval. Pick for long-horizon agents where multi-hop reasoning matters more than latency.
[Graphiti](/tools/graphiti). OSS counterpart to Zep with deeper Neo4j integration. Local-first; full control over the memory graph. Pick when you want graph memory without the hosted-service dependency. Operationally heavier than Mem0; lighter than Letta.
[MCP-memory](/tools/mcp-server-memory). Anthropic reference MCP server. JSON-on-disk knowledge graph; entry-tier complexity. The right pick for “I want persistent memory in Claude Desktop with one config file.” Wobbles past a few thousand entities; not for production agent deployments.
The decision tree:
- Need it tomorrow, low complexity tolerance: Mem0.
- Need deterministic memory state: Letta.
- Need multi-hop temporal reasoning + cloud OK: Zep.
- Need multi-hop temporal reasoning + local-only: Graphiti.
- Need a memory layer for Claude Desktop: MCP-memory.
Local vs hosted memory
The same hosted-vs-self-hosted question that applies to inference applies to memory, with sharper privacy stakes because memory carries cumulative private state.
Local-only paths (Letta, Mem0 OSS, Graphiti, MCP-memory) keep all state on your hardware. The cost: you run the database, you handle backups, you handle scale. The benefit: no third party ever sees agent state.
Hosted paths (Zep cloud, Mem0 cloud) trade third-party visibility for managed scale and cross-machine continuity. Right for some workloads; deal-breaker for others.
The hybrid pattern that works well: local memory for sensitive content, hosted memory for non-sensitive workloads, deliberately separated. Most teams that mix end up over-trusting the hosted side; default to local unless you have a specific reason to use hosted.
When memory helps the agent
- Long-horizon tasks spanning multiple sessions where context across sessions is valuable.
- Repeated patterns where the agent benefits from consolidated learning across episodes.
- Domain-specific knowledge that accumulates across sessions (codebase conventions, team preferences, project history).
- User preferences that should persist across interactions (formatting, language, context cues).
- Multi-step plans that span session boundaries — “continue what we started yesterday.”
When memory hurts the agent
- Single-session tasks with no follow-up. Pure overhead; the retrieval latency is paid for nothing.
- Tasks where stale state is harmful — codebases that change rapidly; environments where “last week's answer” is now wrong.
- Workloads with strict data-residency requirements and a hosted memory provider — the privacy cost outweighs the workflow benefit.
- Agents that haven't been monitored for memory drift. Memory that's not audited becomes confidently wrong; an agent that confidently misremembers is worse than one that has no memory.
- Multi-tenant production where per-user isolation is weak. Cross-tenant memory leaks are catastrophic; don't deploy memory to multi-tenant production until isolation is proven.
Reference stacks for production
The four canonical memory-enabled deployment patterns we recommend in May 2026:
Single-user coding agent with episodic memory. [/stacks/memory-enabled-agent](/stacks/memory-enabled-agent) — OpenHands + Mem0 (LanceDB) + MCP-filesystem/git/postgres + DeepSeek Coder V2 + vLLM. Local-first; private; serves the “remembers what we tried last session” pattern.
Long-horizon planning with explicit memory. OpenHands + Letta + same MCP layer. Letta's explicit memory hierarchy beats Mem0 when the agent needs to reason about what to remember; trade ergonomics for control.
Team-shared memory with hosted graph. Open WebUI + Zep cloud + vLLM. Multi-user agent backend with graph memory; cross-machine continuity; the right pick for a team that's comfortable with hosted memory.
Air-gapped memory with local graph. OpenHands + Graphiti (Neo4j local) + MCP layer. Local graph memory for regulated workloads; operationally heavier than Mem0 but with full local control.
Benchmark and evaluation ideas
The measurements that would let readers actually pick a memory framework. The benchmark dataset plans to extend toward these:
- Recall@K on episode-similarity queries across Mem0 / Letta / Zep / Graphiti at fixed memory size.
- Multi-hop reasoning accuracy on questions that require traversing 2-3 prior episodes — graph methods should win meaningfully.
- Negation-handling accuracy — episodes where the user explicitly rejected something. The hardest consolidation case; good benchmarks expose hallucination.
- Latency at scale across 1K / 100K / 1M episode counts. Vector vs graph diverges sharply at the high end.
- Memory-induced hallucination rate — measure how often the agent makes confident assertions about memory that don't match ground truth. The single most important metric for production deployments.
Companion reading: memory frameworks ecosystem map for the landscape view; /stacks/memory-enabled-agent for the canonical deployment recipe; /systems/mcp for the protocol MCP-memory uses.