Research

Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

August 27, 2025

"The only real mistake is the one from which we learn nothing." - Henry Ford

Long-lived agents will eventually make mistakes and need to recover. For example, a coding agent might implement the wrong solution, or a personal assistant may misunderstand instructions and execute an incorrect sequence of actions. In these cases, it's crucial that agents can recover (or ideally learn) from their mistakes, rather than suffer permanent performance degradation due to context pollution.

Today we're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states. We find that the best-performing models for long-term performance differ from the top performers in fresh states, with GPT-5 showing an especially significant improvement in ranking.

Background

Current agents excel at short-horizon tasks like small bug fixes, but their ability to independently execute complex long-horizon tasks remains limited. With long-horizon tasks, even the most capable frontier models will inevitably make errors at some point in their trajectory. Yet existing agent benchmarks evaluate agents starting from fresh environments. For example, coding agents are usually evaluated starting from a clean repository and GitHub issue. But in reality, coding agents work in messy environments containing files and edits from previous attempts to solve the task, erroneous reasoning traces from debugging sessions, and other artifacts.

Recovery-Bench is a new benchmark that bridges this gap by evaluating agents in more realistic environments that include existing trajectories from previous attempts. Failed trajectories represent opportunities for agents to learn from their mistakes, but our research reveals a counterintuitive finding: the best AI models today are often distracted or misled by failed trajectories in their context, suffering from context pollution.

Context pollution refers to the leftover artifacts of an agent’s failure. It can include erroneous actions, misleading reasoning traces, or corrupted environment states that persist into an agent’s working context. For instance, a coding agent might inherit a broken function or misapplied patch from an earlier attempt, leading it to compound errors rather than correct them.

Our results highlight that recovery represents a distinct capability from performance in fresh states. Claude 4 Sonnet, which tops standard evaluations, drops in rank under recovery conditions, while GPT-5 moves ahead—showing that resilience to context pollution is not correlated with raw problem-solving strength.

Building Recovery-Bench

Recovery-Bench provides a general and simple methodology for measuring agent recovery: how well an agent can successfully complete a task by recovering from previous failures.

We select a long-horizon task and ask a weaker agent to solve it in the standard setting with a clean environment. We then filter the resulting trajectories (the full sequences of actions an agent takes and the environment states those actions produce), keeping only those where the agent failed to complete the task. Finally, we evaluate how well other agents can recover by initializing them with the actions and environment left behind by these failed trajectories.

Recovery-Bench-v0 uses tasks from Terminal-Bench, a challenging benchmark for measuring agents' capabilities at using the terminal. We first collect trajectories from gpt-4o-mini, then evaluate a variety of models (GPT-5, Gemini 2.5 Pro, Claude 4 Sonnet, o4-mini, GPT-4.1) on agent recovery tasks.

How Challenging Is Recovery?

Agent recovery tasks are challenging even for the strongest models. In the original Terminal-Bench setting, models scored an average of 26.3%, with the best model, Claude 4 Sonnet, reaching 34.8%. On Recovery-Bench, models score an average of only 11.2%. Compared to their performance on the original Terminal-Bench, they show a 57% relative decrease in accuracy.

Is Agent Recovery a Standard Capability?

We show that the rankings on Recovery-Bench differ from standard Terminal-Bench settings, indicating that recovery represents an orthogonal capability. For example, Claude 4 Sonnet achieves the highest Terminal-Bench score at 34.8% but ranks third on Recovery-Bench, while GPT-5 achieves only 20.2% on the original Terminal-Bench but ranks first on Recovery-Bench. Interestingly, o4-mini has the lowest Terminal-Bench score but performs significantly better than GPT-4.1 on agent recovery.

What's in a Recovery State?

Recovery states contain two distinct components: environments and action histories. In Terminal-Bench, actions correspond to commands sent to the terminal (e.g., ls) and the resulting terminal states. In realistic settings, agents need to recover from both the modified environment and erroneous actions present in their context.

To better understand what makes Recovery-Bench challenging, we study three different settings that initialize the recovery agent with different components:

Environment: The environment is set to the state after a failed agent trajectory, but the new agent receives no additional context about the initial trajectory
Environment + Action Summary: In addition to the environment, the agent receives a summary of the previous actions
Environment + Action History: In addition to the environment, the agent receives the full actions from the initial failed trajectory

In our benchmark results, we observe that providing just the environment already significantly degrades agent performance compared to the original setting. Giving the agent both the environment and action summary further degrades performance. Interestingly, providing the full action history yields the worst performance, showing that displaying the actions directly doesn't immediately help resolve errors in the environment. Taken together, our results show that Recovery-Bench's challenge stems from both recovering from erroneous actions affecting the environment and managing context pollution from errors in the action history.

Future Work

Our initial results make it clear that today’s frontier models lack the ability to naturally recover from failed states, a key ingredient for continual learning. Our research also suggests that agents may benefit from using different models at different points within a single trajectory, as well as robust context management systems (like those in Letta) that enable the agent to learn from failed trajectories rather than be distracted by them.

We also plan to expand the current methodology for building Recovery-Bench to include tasks beyond Terminal-Bench, including SWE-Bench and other agent benchmarks, to more broadly understand how agents recover. We expect to observe different phenomena in domains beyond terminal use.

At Letta, we are building self-improving, perpetual agents that can learn from experience and adapt over time. If you’re excited by our mission and want to contribute towards work like Recovery-Bench, apply to join our team.

Learn More

GitHub repository: https://github.com/letta-ai/recovery-bench
Building stateful agents with Letta: https://docs.letta.com
Try Letta Cloud: https://app.letta.com

Back

Twitter/X

Company

Company announcements, partnerships

Jul 7, 2025

Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025

Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025

Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025

RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025

Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024

The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024

New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024

Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024

MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Product

Release notes, feature announcements

Oct 23, 2025

Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025

Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025

Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025

Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025

Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025

Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025

Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024

Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024

Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024

Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024

Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024

Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Research

Sleep-time compute, anatomy of a context window

Nov 7, 2025

Can Any Model Use Skills? Adding Skills to Context-Bench

Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Oct 30, 2025

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 12, 2025

Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025

Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025

Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025

Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

Background

Building Recovery-Bench

How Challenging Is Recovery?

Is Agent Recovery a Standard Capability?

What's in a Recovery State?

Future Work

Learn More

Company

Product

Research

Product

DEVELOPERS

Company

Newsletter