Light
Dark
Product

Letta Evals: Evaluating Agents that Learn

October 23, 2025

AI agents are incredibly powerful due to their flexibility and ability to solve problems on their own without being constrained to rigid pre-defined workflows. However, flexibility often can come at the cost of reliability. That's why we're releasing Letta Evals: an open-source evaluation framework for systematically testing agents in Letta. 

What are Agent Evals?  

Building agents is an interactive process – testing different prompts, new models, architectures and tools. What happens when you update your prompt? Switch to a new model? Add a new tool? Will your agent still behave as intended, or will you unknowingly break critical functionality? Manual testing doesn't scale and provide the reliability needed. Without systematic testing, these regressions slip through.

Evals are a way to systematically test LLMs and agents. Evals are typically run by defining a stateless function that wraps the functionality you want to test (e.g. invoking an LLM with tools), running the function on a dataset, and grading the outputs (such as with a LLM-as-judge). 

But over time, agents have become less like stateless functions and more like humans: defined not only by their initialization state, but also their lived experiences. The behavior of long lived, stateful agents changes over time as they accumulate more state and context. As we showed in RecoveryBench, a “fresh” agent is very different from an agent that already has a lived experience. Agent design changes (e.g. adding a new tool) must be evaluated both for new agents and existing agents.

Letta Evals Core Concepts

Letta Evals is an evaluation framework for systematically evaluating the performance of stateful agents in Letta. At its core, it does one thing incredibly well: running agents exactly as they would run in a production environment, and grading the results. Rather than mocking agent behavior or testing in isolation, Letta Evals uses Agent File (.af) to create and evaluate many replicas of an agent in parallel.

There are four core concepts in Letta Evals: datasets, targets, graders, and gates.

  • Datasets: A JSONL file where each line represents a test case, consisting of an input and optional outputs and metadata.
  • Targets: An Agent File (.af) defining the agent (prompts, tool, messages, memory, and other state).
  • Graders: How the responses from the agents are scored. Letta Evals by default comes with tool graders such as string exact match as rubric graders that use LLMs-as-judges with custom scoring prompts. Users can also define their own evaluation grade, or even use a Letta agent as the grader.
  • Gates: Pass/fail thresholds (e.g "95% accuracy" or "zero failures" ) for your evaluation to prevent regression before they ship.
Letta Evals allows for systematic evaluation of agents from your terminal

CI / CD Integration

One of the most powerful uses of Letta Evals is as a quality gate in continuous integration pipelines (see our example GitHub workflow). By defining pass/fail criteria in your suite configuration, you can automatically prevent agents' regressions from being deployed to production. When you run an evaluation in Letta Evals, it returns with a non-zero status if the gate fails. This means you can block pull requests that break agent behavior, just like you would block PRs that break unit tests. By making evals a required check during the deployment pipeline, teams can ensure that every change to your agent is validated against your test suite before it reaches users.

Letta Evals can be integrated into GitHub workflow automations

A / B Testing Experiments

Evals are essential for making data-driven decisions when experimenting with agent improvements. Whether you're comparing different models, testing prompt variations, or evaluating tool changes, Letta Evals provides objective metrics for guiding agent development. The key advantage of systematic A/B testing with evals is reproducibility. Unlike manual testing where results can vary based on who's testing or when they test, evals give you consistent, comparable metrics across experiments. 

Customer Highlight: Bilt Rewards

We're already seeing customers use Letta Evals to ensure production-grade agent deployments. Bilt runs over a million personalized stateful agents for neighborhood commerce recommendations, and uses the Letta Evals framework to maintain reliability as they rapidly iterate on their agent architectures. With millions of agents in production serving personalized recommendations to users, Bilt needs confidence that changes improve performance without breaking existing functionality. By integrating evals into their development workflow, they can validate agent behavior across their test suite before deploying updates - whether they're testing new models, refining prompts, or adjusting their multi-agent architecture.

Letta Leaderboard

Internally at Letta, we are also using the Letta Evals package for evaluating models on the Letta Leaderboard, a set of standardized evaluations for comparing different LLMs on core memory and filesystems capabilities in Letta. These benchmarks help you choose the right model for your use case and track improvements as models evolve. All leaderboard evaluations are available as examples in the Letta Evals package.

Conclusion

As agents become more sophisticated and begin to handle increasingly important tasks where errors can be critical, systematic evaluation of long-running agent behavior becomes essential. Letta Evals provides a rigorous, reproducible way to test agents, catch regressions in agent behavior, and make informed decisions about how to improve your agents through model / tool selection and architectural decisions. 

To try Letta Evals, you can check out:

Jul 7, 2025
Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025
Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025
Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025
RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025
Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024
The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024
New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024
Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024
MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Oct 14, 2025
Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025
Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025
Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025
Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025
Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025
Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024
Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024
Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024
Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024
Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024
Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Aug 27, 2025
Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025
Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025
Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025
Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025
Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.