Research

Skill Learning: Bringing Continual Learning to CLI Agents

December 2, 2025

The biggest gap with AI agents today is their inability to learn and improve over time, particularly from previous experiences such as mistakes or human feedback. We showed in RecoveryBench that an agent’s performance actually degrades if it has previous errors in-context. What if we could instead have agents actually improve from prior experiences (i.e. trajectories)?

To do this, agents must be able to adapt their behavior through memory and learning. Memory for agents comes in different forms. In-context “core memory” (contained as part of the system prompt) is updated through tools (as introduced in MemGPT). Skills, a more recent release from Anthropic, productizes the idea of giving agents reference “handbooks” or “cheatsheets”. Skills allow agents to dynamically load files that contain information about specific tasks (e.g. generating a PDF file) – but are generally static.

Today we’re releasing Skill Learning, a way to dynamically learn skills over time. With Skill Learning, agents can use their past experience to actually improve, rather than degrade. We implement Skill Learning into Letta Code, a research preview of our model-agnostic agent harness built with the Letta API. On Terminal Bench 2.0, we find that learned skills can boost performance by 36.8% relative (15.7% absolute), and thus lead to new capabilities for agents.

Skills in Letta Code lead to new opportunities for continual learning due to model-agnosticity and persistence. For instance, Letta Code agents could use different models across even the same session, allowing stronger models to create skills that weaker models can leverage. Moreover, a Letta Code agent reviewing PRs could use skills generated from the same agent in the command line.

How Does Skill Learning Work?

We power Skill Learning with a two-stage learning process:

Reflection: Given an agent’s trajectory for a task, we generate a reflection that evaluates whether the agent solved the task, assesses the logical soundness of its reasoning, verifies that all steps are justified and edge cases are handled, and identifies any repetitive patterns or actions that could be be abstracted.
Creation: We feed this reflection to our learning agent (powered by Letta Code) which uses Anthropic’s skill-creator to generate a skill that provides potential approaches, common pitfalls, and verification strategies.

While we use a relatively simple reflection step here, we can scale sleep-time compute during reflection to deepen the analysis of the past trajectory and improve the quality of learning.

Experiments

We evaluate whether Skill Learning can actually improve the capabilities of agents using Terminal Bench 2.0, a collection of tasks to evaluate models on real-world tasks within command-line environments. We use Terminal Bench as our testbed since it contains tasks of varying complexity and domain knowledge. For instance, the feal-differential-cryptanalysis task requires “deep domain knowledge and non-trivial algorithmic implementation”.

We start by evaluating Letta Code on Terminal Bench 2.0 to obtain trajectories and textual feedback (verifier logs from a task’s tests that verify task completion). We then use Skill Learning with varying levels of context to generate skills. Finally, we evaluate the agent in three settings:

Baseline: The agent doesn’t have access to any skills. Trajectories and feedback from this setting are used for skill learning.
Skills (Trajectory): The agent has access to skills learned using the baseline agent’s trajectory only.
Skills (Trajectory + Feedback): The agent has access to skills learned using the baseline agent’s trajectory and textual feedback from the verifier.

We evaluate using Letta Code, a model-agnostic agent harness implemented on top of the Letta API. The agent is configured with Sonnet 4.5 (with extended thinking) on all 89 tasks in Terminal Bench 2.0 and we report the aggregated score.

Results: Can Skills Be Learned?

Our results show that skills can be effectively learned, leading to a 21.1% relative (9% absolute) increase in Terminal Bench performance over our baseline agent—which matches Terminus 2 performance—while also reducing costs by 15.7% and tool calls by 10.4%.

To further study skill learning, we enrich the reflection stage with textual feedback that provides context on errors for failed tasks. Skill learning with feedback provided an additional 6.7% improvement, leading to an overall gain of 36.8% relative (15.7% absolute) over the baseline. While skills learned from trajectories alone can capture successful patterns, feedback-informed skills better encode common failure modes and unsuccessful approaches, making them more informative and robust. This suggests that providing richer context, especially textual feedback, can lead to more effective skill learning.

Skills provide domain knowledge and structured approaches that guide task execution and reduce common failure modes like context poisoning and missed edge cases. Ultimately, this enhances agent capabilities and allows them to complete tasks with fewer tool calls and lower overall cost.

How do Skills help?

The build-cython-ext task for compiling Cython extensions has an ~0% model success rate across different frontier models. Without skills, the agent built first and fixed errors as they appeared—a reactive approach that missed critical np.int references hidden in Cython files. With skills, the agent searched for all deprecated patterns before building, caught import alias variations (n. vs np.), and used context-aware replacements to successfully complete the task.

In general, many tasks given to an agent are similar in nature to benefit from shared information. Rather than treating every trajectory as isolated, skill learning allows agents to benefit from what they’ve learned in previous trajectories.

What this means for agent memory & continual learning

We're excited by the research directions that Skill Learning enables for continual learning. Our results show that learning from trajectories alone improves model performance, suggesting more scalable paths to self-improvement beyond traditional human feedback. Agents could extend beyond generating trajectories and skills to also generating and learning from their own tests and even entire tasks. However, since learning from human-written verifiable rewards outperforms trajectory-only learning, another promising direction is building agents that are calibrated to their own capabilities — knowing when to seek external feedback versus when to self-improve autonomously.

Skill Learning enables agents to learn and improve over time, without needing to change their underlying weights. Since skills are stored as .md files, designed to be modular and can be managed by git, they are a convenient way to share learnings across your organization. For Letta agents which support both skills and core memory (in-context memory blocks), memory can be organized in a hierarchy:

Core Memory / System Prompt Learning: Learned system prompt - evolving system prompt that applies across tasks, and is generally specific to that agent’s state.
Skills / Filesystem: Evolving files used for task-specific memory, designed to be interchangeable between agents.

Through combining both memory and skills, agents can evolve over time.

Using Skill Learning

You can use Skill Learning in the Letta Code harness. After interacting with your agent, simply call the /skill command to enter skill learning mode.

Back

Twitter/X

Company

Company announcements, partnerships

Jul 7, 2025

Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025

Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025

Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025

RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025

Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024

The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024

New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024

Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024

MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Product

Release notes, feature announcements

Jan 21, 2026

Conversations: Shared Agent Memory across Concurrent Experiences

The Conversations API allows you to build agents that can maintain shared memory across parallel experiences with users

Dec 16, 2025

Letta Code: A Memory-First Coding Agent

Introducing Letta Code, a memory-first coding agent. Letta Code is the #1 model-agnostic open source agent on the leading AI coding benchmark Terminal-Bench.

Dec 1, 2025

Programmatic Tool Calling with any LLM

The Letta API now supports programmatic tool calling for any LLM model, enabling agents to generate their own workflows.

Oct 23, 2025

Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025

Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025

Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025

Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025

Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025

Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025

Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024

Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024

Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024

Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024

Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024

Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Research

Sleep-time compute, anatomy of a context window

Feb 12, 2026

Introducing Context Repositories: Git-based Memory for Coding Agents

We're introducing Context Repositories, a rebuild of how memory works in Letta Code based on programmatic context management and git-based versioning.

Dec 11, 2025

Continual Learning in Token Space

At Letta, we believe that learning in token space is the key to building AI agents that truly improve over time. Our interest in this problem is driven by a simple observation: agents that can carry their memories across model generations will outlast any single foundation model.

Nov 7, 2025

Can Any Model Use Skills? Adding Skills to Context-Bench

Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Oct 30, 2025

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 27, 2025

Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025

Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025

Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025

Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025

Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.

Skill Learning: Bringing Continual Learning to CLI Agents

How Does Skill Learning Work?

Experiments

Results: Can Skills Be Learned?

How do Skills help?

What this means for agent memory & continual learning

Using Skill Learning

Company

Product

Research

Product

DEVELOPERS

Company

Newsletter