Product

Letta Evals: Evaluating Agents that Learn

October 23, 2025

AI agents are incredibly powerful due to their flexibility and ability to solve problems on their own without being constrained to rigid pre-defined workflows. However, flexibility often can come at the cost of reliability. That's why we're releasing Letta Evals: an open-source evaluation framework for systematically testing agents in Letta.

What are Agent Evals?

Building agents is an interactive process – testing different prompts, new models, architectures and tools. What happens when you update your prompt? Switch to a new model? Add a new tool? Will your agent still behave as intended, or will you unknowingly break critical functionality? Manual testing doesn't scale and provide the reliability needed. Without systematic testing, these regressions slip through.

Evals are a way to systematically test LLMs and agents. Evals are typically run by defining a stateless function that wraps the functionality you want to test (e.g. invoking an LLM with tools), running the function on a dataset, and grading the outputs (such as with a LLM-as-judge).

But over time, agents have become less like stateless functions and more like humans: defined not only by their initialization state, but also their lived experiences. The behavior of long lived, stateful agents changes over time as they accumulate more state and context. As we showed in RecoveryBench, a “fresh” agent is very different from an agent that already has a lived experience. Agent design changes (e.g. adding a new tool) must be evaluated both for new agents and existing agents.

Letta Evals Core Concepts

Letta Evals is an evaluation framework for systematically evaluating the performance of stateful agents in Letta. At its core, it does one thing incredibly well: running agents exactly as they would run in a production environment, and grading the results. Rather than mocking agent behavior or testing in isolation, Letta Evals uses Agent File (.af) to create and evaluate many replicas of an agent in parallel.

There are four core concepts in Letta Evals: datasets, targets, graders, and gates.

Datasets: A JSONL file where each line represents a test case, consisting of an input and optional outputs and metadata.
Targets: An Agent File (.af) defining the agent (prompts, tool, messages, memory, and other state).
Graders: How the responses from the agents are scored. Letta Evals by default comes with tool graders such as string exact match as rubric graders that use LLMs-as-judges with custom scoring prompts. Users can also define their own evaluation grade, or even use a Letta agent as the grader.‍
Gates: Pass/fail thresholds (e.g "95% accuracy" or "zero failures" ) for your evaluation to prevent regression before they ship.

Letta Evals allows for systematic evaluation of agents from your terminal

CI / CD Integration

One of the most powerful uses of Letta Evals is as a quality gate in continuous integration pipelines (see our example GitHub workflow). By defining pass/fail criteria in your suite configuration, you can automatically prevent agents' regressions from being deployed to production. When you run an evaluation in Letta Evals, it returns with a non-zero status if the gate fails. This means you can block pull requests that break agent behavior, just like you would block PRs that break unit tests. By making evals a required check during the deployment pipeline, teams can ensure that every change to your agent is validated against your test suite before it reaches users.

Letta Evals can be integrated into GitHub workflow automations

A / B Testing Experiments

Evals are essential for making data-driven decisions when experimenting with agent improvements. Whether you're comparing different models, testing prompt variations, or evaluating tool changes, Letta Evals provides objective metrics for guiding agent development. The key advantage of systematic A/B testing with evals is reproducibility. Unlike manual testing where results can vary based on who's testing or when they test, evals give you consistent, comparable metrics across experiments.

Customer Highlight: Bilt Rewards

We're already seeing customers use Letta Evals to ensure production-grade agent deployments. Bilt runs over a million personalized stateful agents for neighborhood commerce recommendations, and uses the Letta Evals framework to maintain reliability as they rapidly iterate on their agent architectures. With millions of agents in production serving personalized recommendations to users, Bilt needs confidence that changes improve performance without breaking existing functionality. By integrating evals into their development workflow, they can validate agent behavior across their test suite before deploying updates - whether they're testing new models, refining prompts, or adjusting their multi-agent architecture.

Letta Leaderboard

Internally at Letta, we are also using the Letta Evals package for evaluating models on the Letta Leaderboard, a set of standardized evaluations for comparing different LLMs on core memory and filesystems capabilities in Letta. These benchmarks help you choose the right model for your use case and track improvements as models evolve. All leaderboard evaluations are available as examples in the Letta Evals package.

Conclusion

As agents become more sophisticated and begin to handle increasingly important tasks where errors can be critical, systematic evaluation of long-running agent behavior becomes essential. Letta Evals provides a rigorous, reproducible way to test agents, catch regressions in agent behavior, and make informed decisions about how to improve your agents through model / tool selection and architectural decisions.

To try Letta Evals, you can check out:

Letta Evals OSS repository: https://github.com/letta-ai/letta-evals
Letta Evals documentation: https://docs.letta.com/evals
Letta Platform: https://app.letta.com

Back

Twitter/X

Company

Company announcements, partnerships

Jul 7, 2025

Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025

Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025

Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025

RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025

Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024

The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024

New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024

Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024

MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Product

Release notes, feature announcements

Dec 1, 2025

Programmatic Tool Calling with any LLM

The Letta API now supports programmatic tool calling for any LLM model, enabling agents to generate their own workflows.

Oct 14, 2025

Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025

Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025

Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025

Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025

Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025

Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024

Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024

Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024

Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024

Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024

Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Research

Sleep-time compute, anatomy of a context window

Dec 2, 2025

Skill Learning: Bringing Continual Learning to CLI Agents

Today we’re releasing Skill Learning, a way to dynamically learn skills through experience. With Skill Learning, agents can use their past experience to actually improve, rather than degrade, over time.

Nov 7, 2025

Can Any Model Use Skills? Adding Skills to Context-Bench

Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Oct 30, 2025

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 27, 2025

Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025

Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025

Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025

Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025

Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Letta Evals: Evaluating Agents that Learn

What are Agent Evals?

Letta Evals Core Concepts

CI / CD Integration

A / B Testing Experiments

Customer Highlight: Bilt Rewards

Letta Leaderboard

Conclusion

Company

Product

Research

Product

DEVELOPERS

Company

Newsletter