Research

Benchmarking AI Agent Memory: Is a Filesystem All You Need?

August 12, 2025

Summary: Letta agents achieve 74.0% accuracy on LoCoMo by simply storing conversation histories in files, rather than using specialized memory or retrieval tools. This suggests that: 1) current memory benchmarks may not be very meaningful, and 2) memory is more about how agents manage context than the exact retrieval mechanism used.

Memory for AI Agents

Since the dawn of GPT-4, LLMs have been their limited context length. Without long-term memory, LLMs and agents face significant limitations: they forget information, cannot learn and improve over time, and lose track of their objectives during long-running, complex tasks (a phenomenon often referred to as “derailment”).

MemGPT introduced memory management for agents by creating a memory hierarchy inspired by a traditional operating system (OS). Agents actively manage what remains in their immediate context (core memory) versus what gets stored in external layers (conversational memory, archival memory, and external files) that can be retrieved as needed. This approach allows agents to maintain unlimited memory capacity within fixed context windows. Many agentic systems today, including Letta, implement MemGPT’s design to enable long-term memory in LLM agents.

Additionally, various memory-specific tools have emerged to offer "memory" as a pluggable service, providing agents with tools to store and retrieve information, often using specialized knowledge graphs or vector database solutions.

Attempts at Benchmarking Memory Tools (e.g., Mem0, LangMem, Zep)

Evaluating the effectiveness of these memory tools in isolation is extremely challenging. The quality of an agent's memory often depends more on the underlying agentic system's ability to manage context and call tools than on the memory tools themselves. For example, even if a search tool is theoretically more performant, it won't work well for memory if the agent cannot use it effectively (due to poor prompting or lack of examples in training data).

As a result, evaluation of memory tools has primarily focused on retrieval benchmarks like LoCoMo, rather than agentic memory itself. LoCoMo is a question-answering benchmark focusing on retrieval from long conversations. Each sample contains two fictional speakers and a list of AI-generated, timestamped conversations. The task involves answering questions about the speakers or facts presented in their conversations.

One memory tool creator, Mem0, published controversial results claiming to have run MemGPT on LoCoMo. The results were puzzling, since our research team (the same team behind MemGPT) was unable to determine a way to backfill LoCoMo data into MemGPT/Letta without significant refactoring of the codebase. Mem0 did not respond to requests for clarification on how the benchmarking numbers were computed, or provide any modified MemGPT implementation that supports meaningful backfill of LoCoMo data.

Benchmarking Letta Filesystem with LoCoMo

Although Letta does not have a native way to ingest conversational histories like those in LoCoMo, we recently added support for connecting files to Letta agents (including MemGPT agents) - called Letta Filesystem. We were curious to see how Letta would perform by simply placing the LoCoMo conversational history into a file, without any specialized memory tools.

When files are attached to a Letta agent, the agent gains access to a set of file operation tools:

grep
search_files
open
close

The conversational data is placed into a file, which is uploaded and attached to the agent. Files in Letta are automatically parsed and embedded to enable semantic (vector) search over their contents. The agent is given tools for semantic search (search_files), text matching (grep), and answering questions (answer_question).

We used GPT-4o mini for the agent to match the original experiment that was said to have been run with MemGPT. Since GPT-4o mini is a weaker model, we made the agent only partially autonomous by defining tool rules to limit the agent's tool-calling patterns. The agent must start by calling search_files and continue searching through files until it decides to call answer_question and terminate. What it searches for and how many times it calls tools is up to the agent.

This simple agent achieves 74.0% on LoCoMo with GPT-4o mini and minimal prompt tuning, significantly above Mem0's reported 68.5% score for their top-performing graph variant.

Why Does a Filesystem Beat Specialized Memory Tools?

Agents today are highly effective at using tools, especially those likely to have been in their training data (such as filesystem operations). As a result, specialized memory tools that may have originally been designed for single-hop retrieval are less effective than simply allowing the agent to autonomously search through data with iterative querying.

Agents can generate their own queries rather than simply searching the original questions (e.g., transforming "How does Calvin stay motivated when faced with setbacks?" into "Calvin motivation setbacks"), and they can continue searching until the right data is found.

Memory for Agents: Agent Capabilities Matter More Than the Tools

Whether an agent "remembers" something depends on whether it successfully retrieves the right information when needed. Therefore, it's much more important to consider whether an agent will be able to effectively use a retrieval tool (knowing when and how to call it) rather than focusing on the exact retrieval mechanisms (e.g. knowledge graphs vs vector databases).

Agents today are extremely effective at using filesystem tools, largely due to post-training optimization for agentic coding tasks. In general, simpler tools are more likely to be in the training data of an agent and therefore more likely to be used effectively. While more complex solutions like knowledge graphs may help in specific domains, they may also come at the cost of being more difficult for the LLM (agent) to understand.

How to Properly Evaluate Agent Memory

An agent's memory depends on the agent architecture, its tools, and the underlying model. Comparing agent frameworks and agent memory tools is like comparing apples to oranges, as you can always mix and match frameworks and tools (and, of course, models).

The Letta Memory Benchmark (Letta Leaderboard) provides an apples-to-apples comparison evaluating different models' capabilities in terms of memory management, keeping the framework (currently just Letta) and tools constant. The benchmark creates memory interactions on-the-fly to evaluate memory in a dynamic context, rather than just retrieval (as with LoCoMo).

Another approach to evaluating memory is to assess the agent's holistic performance on specific tasks that require memory. One example is Terminal-Bench, which evaluates how well agents can solve complex, long-running tasks. Because tasks are long-running and require processing far more state than what fits into context, agents can leverage their memory to keep track of their task state and progress. Letta's OSS terminal-use agent is currently #4 overall (#1 for OSS) on the Terminal-Bench coding benchmark.

Conclusion

With a well-designed agent, even simple filesystem tools are sufficient to perform well on retrieval benchmarks such as LoCoMo. More complex memory tools can be plugged into agent frameworks like Letta via MCP or custom tools.

For more resources, see:

Letta Memory Benchmark for evaluating model capabilities for agentic memory
Code for running the LoCoMo benchmark

You can get started with Letta agents with:

Letta Cloud for cloud-hosted agents‍
Letta Desktop for fully local agents

Back

Twitter/X

Company

Company announcements, partnerships

Jul 7, 2025

Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025

Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025

Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025

RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025

Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024

The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024

New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024

Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024

MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Product

Release notes, feature announcements

Oct 23, 2025

Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025

Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025

Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025

Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025

Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025

Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025

Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024

Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024

Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024

Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024

Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024

Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Research

Sleep-time compute, anatomy of a context window

Nov 7, 2025

Can Any Model Use Skills? Adding Skills to Context-Bench

Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Oct 30, 2025

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 27, 2025

Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 5, 2025

Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025

Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025

Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Memory for AI Agents

Attempts at Benchmarking Memory Tools (e.g., Mem0, LangMem, Zep)

Benchmarking Letta Filesystem with LoCoMo

Why Does a Filesystem Beat Specialized Memory Tools?

Memory for Agents: Agent Capabilities Matter More Than the Tools

How to Properly Evaluate Agent Memory

Conclusion

Company

Product

Research

Product

DEVELOPERS

Company

Newsletter