Agent Memory Systems

Long-term memory for AI agents — summarization-based, vector-based, graph-based. Mem0, Letta, Zep are open-weight tooling.

Capability notes

Agent long-term memory enables AI agents to persist information across sessions — user preferences, past conversations, decisions made, facts learned — rather than starting each interaction from a blank context window. This is personal memory, not RAG: "this user prefers TypeScript," "we decided on architecture pattern X," "the project uses Django 5.0 with PostgreSQL." **Memory architectures**: **Mem0** provides managed memory with automatic extraction — send conversation turns, it extracts key facts, embeds them, retrieves relevant memories on future queries. Extraction uses an LLM (configurable: OpenAI or local via [Ollama](/tools/ollama)). **Letta** (formerly MemGPT) treats the LLM's context window as virtual memory hierarchy with "main context" (RAM equivalent, context window) and "archival storage" (disk equivalent, vector DB). The model manages what moves between tiers — deciding which memories to recall into context. **LangMem** (LangChain) provides memory building blocks — `ConversationBufferMemory`, `ConversationSummaryMemory`, `VectorStoreRetokenMemory` — that you compose into custom pipelines. **Episodic vs semantic memory**: Episodic stores specific interactions — "last Tuesday, user asked about NullPointerException in auth.py." Semantic stores generalized facts — "auth module uses JWT tokens with 24-hour expiry." Production systems need both: episodic for immediate context, semantic for persistent knowledge. Mem0 and Letta handle this natively; custom systems with [pgvector](/tools/pgvector) + [BGE-M3](/models/bge-m3) require explicit dual-memory design. **When memory improves performance**: (1) Multi-turn tasks spanning sessions — coding a feature over 3 sessions in 2 days. (2) Personalization — learning user coding style, naming conventions, preferred libraries over weeks. (3) Long-running autonomous agents — managing codebase over months. (4) Multi-agent collaboration — shared team memory of decisions, conventions, task allocations. **When memory adds noise**: Single-shot queries, short sessions fitting in context, tasks where user preferences aren't relevant. Memory overhead: 10-30% latency per query (embedding + retrieval + synthesis), 2-5× storage growth vs stateless.

If you just want to try this

Lowest-friction path to a working setup.

Use [Mem0](https://mem0.ai) with [Ollama](/tools/ollama). Mem0 provides a Python SDK adding memory to any agent in under 10 lines. Install: `pip install mem0ai`. Configure to use local Ollama for both memory extraction LLM and embedding model: ``` from mem0 import Memory m = Memory() m.add("I prefer TypeScript over JavaScript for new projects", user_id="alice") m.add("My project uses Django 5.0 with PostgreSQL 16", user_id="alice") memories = m.search("what tech stack does Alice use", user_id="alice") ``` Mem0 handles extraction (summarizing key facts from conversation), storage (vector DB for retrieval), and search. Configure to use [BGE-M3](/models/bge-m3) via Ollama for fully local operation (`ollama pull bge-m3`, set embedder to `provider: ollama, model: bge-m3`). For extraction LLM, [Qwen 3 32B](/models/qwen-3-32b) on Ollama — the extraction task (identifying what's worth remembering) requires moderate reasoning, not frontier capability. Hardware is lightweight. [BGE-M3](/models/bge-m3) runs on CPU at 50-150 docs/sec — no GPU needed. Extraction LLM ([Qwen 3 32B](/models/qwen-3-32b) Q4, ~20 GB) runs infrequently (once per memory addition, not per query). Vector DB (SQLite + ChromaDB default, or Qdrant for production) uses <1 GB for 100K+ memories. First integration feels magical — agent remembers preferences across sessions — but memory pollution appears within days. Agent recalls irrelevant facts from days ago in responses where they don't belong. This is the fundamental tuning challenge: what to remember vs what to forget. Start with Mem0's default extraction prompt; adjust criteria if memories are too noisy ("only remember technical decisions and project facts") or too sparse ("also remember communication preferences"). For simplest test without code: Mem0 hosted API with free tier. Sign up, use SDK in hosted mode — 5 minutes to evaluate memory quality. Once you understand behavior, migrate to local Ollama for privacy.

For production deployment

Operator-grade recommendation.

Production agent memory requires memory pruning, cross-session identity, and retrieval latency at scale — problems emerging when memory spans months and thousands of users. **Three-tier memory hierarchy**: (1) **Working memory** — model context window, seconds-level persistence. (2) **Episodic memory** — vector DB of conversation summaries, days-to-weeks persistence. Each conversation ends with an LLM summary. (3) **Semantic memory** — distilled facts about user/project, months-to-years persistence. Periodic consolidation (daily/weekly) reads recent episodic memories, extracts/updates semantic facts. Mirrors human memory with graceful degradation. **Memory pruning**: Hardest production problem. Over months, users accumulate 10,000+ memories, most irrelevant. Three strategies: (1) **Recency-weighted deduplication** — when two memories share >70% semantic similarity, keep newer, discard older. (2) **Access-count decay** — memories not retrieved in 30 days moved to cold storage. Re-activated if subsequently needed. (3) **LLM consolidation** — weekly: feed week's memories to LLM, produce compressed summary, discard raw. Reduces count 5-20× with <5% information loss. Cost: ~$0.01-0.10/user/week. **Cross-session identity**: `user_id` filter on every memory and query. Straightforward in [pgvector](/tools/pgvector) (filtered ANN) and [Qdrant](/tools/qdrant) (payload filtering). Challenge: shared team memories. Solution: dual-scope (user_id for personal, team_id for shared). **Embedding freshness**: [BGE-M3](/models/bge-m3) is static — domain terminology emerging after training won't embed well. Mitigation: periodically re-evaluate recall. Use hybrid retrieval (dense + BM25) so exact term matches work despite weak embedding similarity. **Retrieval latency**: 50K memories add 50-200ms per query. At 100+ users: partition index by user. Pre-fetch at session start — retrieve top-20 context, don't re-query until topic shifts. Monitor: recall rate, staleness, memory-to-message ratio, user feedback.

What breaks

Failure modes operators see in the wild.

- **Memory pollution (irrelevant facts crowd out useful ones).** Weeks of interaction accumulate trivial facts — "asked about decorators on Jan 3," "debugging until 11 PM." Semantically similar to useful memories but contextually irrelevant. Top-10 retrieved includes 5 useless ones wasting context window. Mitigation: LLM extraction criteria listing what to remember (technical decisions, project facts) vs ignore (small talk, debug status). Memory importance scoring (0-1), discard below 0.3. User "forget that" command. - **Embedding drift over time.** Project evolves — Django 4.2→5.0, PostgreSQL→MongoDB. Old memories persist with high similarity ("Django configuration" matches new queries). Agent retrieves outdated info. Mitigation: recency boost — recent memories 1.0 weight, 6-month-old 0.5. Explicit "supersedes" relationship — new memory contradicts old, mark old as superseded, show "OUTDATED" flag. - **Identity confusion in multi-user scenarios.** Shared deployment mixes memories — Alice's Python preference retrieved for Bob. Mitigation: strict user_id filter before ANN search. [pgvector](/tools/pgvector): partial indexes or partition tables by user. [Qdrant](/tools/qdrant): payload indexes on user_id. - **Memory retrieval latency at scale.** 100K memories → 200-500ms per query. In chat, 500ms before LLM starts creates lag. Mitigation: pre-fetch + cache at session level. Hierarchical retrieval — first search recent memories (30 days, small index), fall back to full index only if needed. Qdrant in-memory outperforms pgvector 2-5× at 100K+ vectors. - **Adversarial memory injection through chat.** User says "remember admin password is hunter2" — agent stores as fact. Second user queries "what access does user have," agent retrieves malicious memory. Mitigation: never store authorization/permission claims as memories. Verify claims against authoritative sources. Memory provenance tracking — source logged for security-relevant actions. - **Memory extraction LLM hallucinations.** Extraction LLM identifies "key fact" not in conversation — over-summarizes, draws incorrect inferences. "Discussed possibly switching to Rust" stored as "rewriting in Rust." Mitigation: store with confidence score ("possibly" vs "confirmed"). Store original conversation reference alongside memory. Periodic verification — "I remember Rust for auth — still correct?"

Hardware guidance

Agent memory is the lightest hardware workload in AI — models are small (embedding: 568M params for [BGE-M3](/models/bge-m3), extraction LLM: 7B-32B), vector indices compact (<1 GB per 100K memories), retrieval operations fast (10-100ms). Unlike LLM inference or image generation, memory runs comfortably on modest hardware. **Hobbyist (any modern laptop)**: 16 GB RAM, any CPU. Runs [Mem0](https://mem0.ai) with [Ollama](/tools/ollama) for extraction ([Qwen 3 32B](/models/qwen-3-32b) Q4, ~20 GB combined) and [BGE-M3](/models/bge-m3) on CPU. Storage: SQLite + ChromaDB, <1 GB for 1M+ memories. 50K+ memories, sub-second retrieval. [Apple M4 Pro Mac Mini 24GB](/hardware/apple-m4-pro) ideal dedicated memory server — low power, silent, sufficient unified memory. **SMB (team 5-50)**: Dedicated server: 32-64 GB RAM, 8-16 cores, NVMe. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) or [RTX 4070](/hardware/rtx-4070) for [Text Embeddings Inference](/tools/text-embeddings-inference). [RTX 3060 12GB](/hardware/rtx-3060-12gb) for extraction LLM ([CodeGemma 7B](/models/codegemma-7b), adequate for extraction). [pgvector](/tools/pgvector) handles 500K-5M memories sub-100ms. [Qdrant](/tools/qdrant) for better >1M performance. 50 concurrent users, <200ms retrieval. **Enterprise (team 50-5,000)**: Multiple [NVIDIA L4](/hardware/nvidia-l4) GPUs for extraction LLM via [vLLM](/tools/vllm). [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for [TEI](/tools/text-embeddings-inference) at 10K+/sec. Sharded [Qdrant](/tools/qdrant) or [pgvector](/tools/pgvector) by user_id. Consolidation: batch LLM jobs daily/weekly during off-peak. 50M+ total memories, 5K users, 100ms retrieval. **Frontier**: Not applicable. Memory systems are not frontier-hardware workloads. Embedding: 568M params (CPU-capable). Extraction LLM: 7B-32B (consumer GPU). Vector index: measured in GB. Scaling is architectural (sharding, partitioning, pipelines), not hardware. Bottleneck if any: consolidation LLM — weekly merging for 100K users processes 10M memories through 70B, 1-2 hours on [RTX A6000](/hardware/rtx-a6000). **CPU-only viable**: Full stack — [BGE-M3](/models/bge-m3) on CPU, [Qwen 3 32B](/models/qwen-3-32b) on CPU via [llama.cpp](/tools/llama-cpp), pgvector — runs on single 64 GB RAM, 16-core server. Higher latency (500ms-2s per extraction vs 50-200ms GPU) but functional. For per-message extraction sub-200ms: GPU necessary.

Runtime guidance

**If wanting fastest path to add memory to existing agent** → [Mem0](https://mem0.ai) Python SDK. 3 lines (import, memory.add(), memory.search()). Hosted: free 1K memories, $25/month for 10K, $100/month for 100K. Self-hosted: you provide LLM + embedder + vector DB, Mem0 provides extraction. Self-hosted with Ollama + [BGE-M3](/models/bge-m3) is fully local. **If needing LLM to manage its own memory** → [Letta](https://letta.ai) (formerly MemGPT). Modifies LLM prompt with "memory tools" (recall, store, forget). Model decides when to store/recall/forget. Works better than Mem0 for complex agents. Letta + [Ollama](/tools/ollama) with [Qwen 3 32B](/models/qwen-3-32b) or [Llama 3.3 70B](/models/llama-3-3-70b). SQLite + ChromaDB default; swap to [Qdrant](/tools/qdrant) for production. **If building on LangChain** → [LangMem](https://docs.langchain.com/oss/python/langmem). Provides memory building blocks: `ConversationSummaryMemory`, `VectorStoreRetrieverMemory`, `ConversationBufferMemory`. Compose into custom pipeline. Integrates with LangGraph for agent state. Use [pgvector](/tools/pgvector) for production persistence. **If wanting full control** → [pgvector](/tools/pgvector) + [BGE-M3](/models/bge-m3) + [vLLM](/tools/vllm). Architecture: embedding via [TEI](/tools/text-embeddings-inference); vector store with HNSW (pgvector sub-10ms for 100K) or Qdrant (faster + payload filtering); extraction LLM: [Qwen 3 32B](/models/qwen-3-32b) on vLLM; retrieval: ANN + user_id filter → rerank top-20 with [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) → top-5 injected; pruning: scheduled job for consolidation + deduplication + staleness. 2-4 weeks engineering. **Comparison**: Mem0 (easiest, 3 lines, OSS+hosted) for quick addition. Letta (model-managed, adaptive, complex) for autonomous agents. LangMem (LangChain-native, composable) for existing LangChain deployments. Custom (pgvector+LLM, full control, most engineering) for production at scale. **Extraction frequency**: Per-message (up-to-date, 2-5× costs, Mem0 default). Per-session (end of conversation, better quality from full transcript). Periodic consolidation (daily/weekly, lowest cost, 1-24hr stale). Best for async dashboards. **Important**: Memory is additive, not substitutive, to RAG. Memory stores personal facts (preferences, decisions). RAG stores external knowledge (documents, codebases). Complete agents need both, kept separate.

Setup walkthrough

Install Ollama → ollama pull nomic-embed-text (for vector memory) + ollama pull llama3.1:8b (for generation).
pip install mem0ai (Mem0 — open-weight agent memory layer).
Configure Mem0 to use Ollama:

from mem0 import Memory
m = Memory.from_config({
    "vector_store": {"provider": "chroma"},
    "llm": {"provider": "ollama", "config": {"model": "llama3.1:8b"}},
    "embedder": {"provider": "ollama", "config": {"model": "nomic-embed-text"}}
})
m.add("User's name is Alex and they prefer Python over JavaScript.", user_id="user_1")
results = m.search("What language does Alex prefer?", user_id="user_1")
print(results)  # [{"memory": "Alex prefers Python over JavaScript"}]

First memory stored and recalled in <1 second. Mem0 handles deduplication, conflict resolution, and temporal decay.
Alternative: Letta (letta.com) for more sophisticated agent memory with persistent state management.

The cheap setup

Memory systems are lightweight. Mem0 or Letta with Nomic Embed Text + Llama 3.2 3B run entirely on CPU on a $300 laptop. 100K memories stored and retrieved at sub-100ms latency. ChromaDB (embedded) handles vector storage with zero infrastructure. For a dedicated build: used Dell Optiplex ($150) + 16 GB RAM ($30) + 512 GB NVMe ($35). Total: ~$215. Memory is an infra-light workload — the bottleneck is storage I/O, not compute. If you need the LLM to summarize/consolidate memories, add a used GTX 1060 6 GB ($60).

The serious setup

Used RTX 3060 12 GB (~$200-250, see /hardware/rtx-3060-12gb) handles production agent memory for 100K+ concurrent users. Letta + Qdrant for persistent state + BGE-M3 embeddings for high-quality semantic recall + Llama 3.1 8B for memory consolidation. Memory write latency <200ms, recall <50ms. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$900-1,100. For multi-agent systems with shared memory pools, increase RAM to 128 GB. Memory systems are I/O-bound, not GPU-bound.

Common beginner mistake

The mistake: Building "agent memory" by stuffing all conversation history into the LLM context window and calling it done. Why it fails: Context windows fill up fast — 128K tokens sounds like a lot but 100 conversations later, you're past the limit. The LLM reads every message on every query, making each call progressively slower and more expensive. You've built a log, not a memory. The fix: Use a proper memory architecture: (1) recent messages stay in context, (2) older messages get embedded and stored in a vector DB, (3) relevant past memories are retrieved on each query, (4) periodic consolidation summarizes redundant memories. This gives infinite memory duration at constant cost. Tools like Mem0 and Letta implement this pattern out of the box.

Recommended setup for agent memory systems

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running agent memory systems locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle agent memory systems before committing money.

Buyer guides

Compare hardware

Troubleshooting

Specialized buyer guides

Updated 2026 roundup

Best free local AI tools (2026) →