Research

Building the #1 open source terminal-use agent using Letta

August 5, 2025

Our Results: 42.5% on Terminal-Bench (#1 open-source agent)

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet. Our agent is implemented in under 200 lines of code using Letta’s stateful agents SDK.

Our results using Claude 4 Sonnet roughly match Claude Code using 4 Opus, a much larger and expensive model. This result places our agent in an elite category on one of the most challenging benchmarks for AI agents - a benchmark where even the best proprietary models like Gemini 2.5 Pro, GPT 4.1, and o3 struggle to get above 30%.

This achievement demonstrates a critical point: agents with effective context management can achieve significant gains in long-running tasks. Letta makes it easy to build specialized agents on top of, with minimal scaffolding and managed memory and state.

What is Terminal-Bench?

Terminal-Bench is a benchmark that evaluates AI agents on real-world command-line tasks consisting of more than 100 challenging tasks that test an agent's capabilities in terminal environments. What makes Terminal-Bench particularly valuable is its focus on real-world complexity, as it consists of tasks such as

Compiling code repositories and building Linux kernels from source
Training machine learning models
Setting up and configuring servers
Debugging system configurations

Each task is containerized in Docker environments resembling terminal tasks that engineers and scientists have to deal with every day.

Building a terminal-use agent with Letta

Letta provides a stateful agents API layer compatible with any model (OpenAI, Anthropic, etc.) Letta provides tools for managing the context window (or agent memory) over time, such as re-writing segments of the context window (referred to as memory blocks), compacting context, or storing and retrieving external memories.

Our terminal-use agent uses Letta’s built in capabilities for context management and memory, specifically memory blocks.

The terminal-usage agent has two memory blocks:

A read-only “task description” block
An read-write “todo list” block used for planning

The agent is able to modify the todo list over time using the memory_replace and memory_insert tools provided by Letta.

The agent is also given additional tools specifically for Terminal-Bench:

send_keys to execute terminal commands, task_completed to signal task completion, and quit_process to interrupt the current running process.

Visualizing the Letta Terminal-Bench agent's control flow inside the ADE

To solve a terminal task, we instantiate a Letta terminal-use agent and grant it access to the terminal environment. The agent observes the current terminal state, updates its internal todo list as necessary, and generates the next command to execute based on its planning. This cycle — observing the environment, updating memory, and executing actions — repeats iteratively until the agent calls the task_completed tool. Occasionally, when the context window reaches above a reasonable level (40k tokens), Letta performs recursive summarizations (i.e. compaction) of previous messages. The agent’s ability to manage its memory (the message history and memory blocks) allows it to avoid common pitfalls like derailment and distraction while solving long-running, complex tasks.

With Letta, agent developers can rapidly specialize agents for specific tasks, by focusing on building the right prompts, tools, and environment. Building the #1 open source terminal-use agent with Letta shows that general memory management in Letta provides effective building blocks for better and more performant agents beyond long-running chatbots.

Learn more

See our results on Terminal-Bench: https://www.tbench.ai/leaderboard
Take a look at our benchmark repository: https://github.com/letta-ai/letta-terminalbench‍
Learn more about building with Letta: https://docs.letta.com

Back

Twitter/X

Company

Company announcements, partnerships

Jul 7, 2025

Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025

Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025

Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025

RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025

Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024

The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024

New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024

Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024

MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Product

Release notes, feature announcements

Oct 23, 2025

Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025

Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025

Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025

Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025

Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025

Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025

Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024

Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024

Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024

Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024

Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024

Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Research

Sleep-time compute, anatomy of a context window

Oct 30, 2025

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 27, 2025

Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025

Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

May 29, 2025

Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025

Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Building the #1 open source terminal-use agent using Letta

Our Results: 42.5% on Terminal-Bench (#1 open-source agent)

What is Terminal-Bench?

Building a terminal-use agent with Letta

Learn more

Company

Product

Research

Product

DEVELOPERS

Company

Newsletter