Light
Dark
Research

Skill Learning: Bringing Continual Learning to CLI Agents

December 2, 2025

The biggest gap with AI agents today is their inability to learn and improve over time, particularly from previous experiences such as mistakes or human feedback. We showed in RecoveryBench that an agent’s performance actually degrades if it has previous errors in-context. What if we could instead have agents actually improve from prior experiences (i.e. trajectories)?

To do this, agents must be able to adapt their behavior through memory and learning. Memory for agents comes in different forms. In-context “core memory” (contained as part of the system prompt) is updated through tools (as introduced in MemGPT). Skills, a more recent release from Anthropic, productizes the idea of giving agents reference “handbooks” or “cheatsheets”. Skills allow agents to dynamically load files that contain information about specific tasks (e.g. generating a PDF file) – but are generally static. 

Today we’re releasing Skill Learning, a way to dynamically learn skills over time. With Skill Learning, agents can use their past experience to actually improve, rather than degrade. We implement Skill Learning into Letta Code, a research preview of our model-agnostic agent harness built with the Letta API. On Terminal Bench 2.0, we find that learned skills can boost performance by 36.8% relative (15.7% absolute), and thus lead to new capabilities for agents.

Skills in Letta Code lead to new opportunities for continual learning due to model-agnosticity and persistence. For instance, Letta Code agents could use different models across even the same session, allowing stronger models to create skills that weaker models can leverage. Moreover, a Letta Code agent reviewing PRs could use skills generated from the same agent in the command line.

How Does Skill Learning Work?

We power Skill Learning with a two-stage learning process:

  1. Reflection: Given an agent’s trajectory for a task, we generate a reflection that evaluates whether the agent solved the task, assesses the logical soundness of its reasoning, verifies that all steps are justified and edge cases are handled, and identifies any repetitive patterns or actions that could be be abstracted.
  2. Creation: We feed this reflection to our learning agent (powered by Letta Code) which uses Anthropic’s skill-creator to generate a skill that provides potential approaches, common pitfalls, and verification strategies.

While we use a relatively simple reflection step here, we can scale sleep-time compute during reflection to deepen the analysis of the past trajectory and improve the quality of learning.  

Experiments

We evaluate whether Skill Learning can actually improve the capabilities of agents using Terminal Bench 2.0, a collection of tasks to evaluate models on real-world tasks within command-line environments. We use Terminal Bench as our testbed since it contains tasks of varying complexity and domain knowledge. For instance, the feal-differential-cryptanalysis task requires “deep domain knowledge and non-trivial algorithmic implementation”.

We start by evaluating Letta Code on Terminal Bench 2.0 to obtain trajectories and textual feedback (verifier logs from a task’s tests that verify task completion). We then use Skill Learning with varying levels of context to generate skills. Finally, we evaluate the agent in three settings:

  1. Baseline: The agent doesn’t have access to any skills. Trajectories and feedback from this setting are used for skill learning.
  2. Skills (Trajectory): The agent has access to skills learned using the baseline agent’s trajectory only.
  3. Skills (Trajectory + Feedback): The agent has access to skills learned using the baseline agent’s trajectory and textual feedback from the verifier.

We evaluate using Letta Code, a model-agnostic agent harness implemented on top of the Letta API. The agent is configured with Sonnet 4.5 (with extended thinking) on all 89 tasks in Terminal Bench 2.0 and we report the aggregated score.

Results: Can Skills Be Learned?

Our results show that skills can be effectively learned, leading to a 21.1% relative (9% absolute) increase in Terminal Bench performance over our baseline agent—which matches Terminus 2 performance—while also reducing costs by 15.7% and tool calls by 10.4%.

To further study skill learning, we enrich the reflection stage with textual feedback that provides context on errors for failed tasks. Skill learning with feedback provided an additional 6.7% improvement, leading to an overall gain of 36.8% relative (15.7% absolute) over the baseline. While skills learned from trajectories alone can capture successful patterns, feedback-informed skills better encode common failure modes and unsuccessful approaches, making them more informative and robust. This suggests that providing richer context, especially textual feedback, can lead to more effective skill learning.

Skills provide domain knowledge and structured approaches that guide task execution and reduce common failure modes like context poisoning and missed edge cases. Ultimately, this enhances agent capabilities and allows them to complete tasks with fewer tool calls and lower overall cost.

How do Skills help?

The build-cython-ext task for compiling Cython extensions has an ~0% model success rate across different frontier models. Without skills, the agent built first and fixed errors as they appeared—a reactive approach that missed critical np.int references hidden in Cython files. With skills, the agent searched for all deprecated patterns before building, caught import alias variations (n. vs np.), and used context-aware replacements to successfully complete the task.

In general, many tasks given to an agent are similar in nature to benefit from shared information. Rather than treating every trajectory as isolated, skill learning allows agents to benefit from what they’ve learned in previous trajectories.

What this means for agent memory & continual learning

We're excited by the research directions that Skill Learning enables for continual learning. Our results show that learning from trajectories alone improves model performance, suggesting more scalable paths to self-improvement beyond traditional human feedback. Agents could extend beyond generating trajectories and skills to also generating and learning from their own tests and even entire tasks. However, since learning from human-written verifiable rewards outperforms trajectory-only learning, another promising direction is building agents that are calibrated to their own capabilities — knowing when to seek external feedback versus when to self-improve autonomously.

Skill Learning enables agents to learn and improve over time, without needing to change their underlying weights. Since skills are stored as .md files, designed to be modular and can be managed by git, they are a convenient way to share learnings across your organization. For Letta agents which support both skills and core memory (in-context memory blocks), memory can be organized in a hierarchy: 

  • Core Memory / System Prompt Learning: Learned system prompt - evolving system prompt that applies across tasks, and is generally specific to that agent’s state.
  • Skills / Filesystem: Evolving files used for task-specific memory, designed to be interchangeable between agents.

Through combining both memory and skills, agents can evolve over time.

Using Skill Learning 

You can use Skill Learning in the Letta Code harness. After interacting with your agent, simply call the /skill command to enter skill learning mode.

Jul 7, 2025
Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025
Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025
Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025
RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025
Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024
The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024
New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024
Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024
MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Dec 1, 2025
Programmatic Tool Calling with any LLM

The Letta API now supports programmatic tool calling for any LLM model, enabling agents to generate their own workflows.

Oct 23, 2025
Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025
Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025
Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025
Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025
Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025
Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025
Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024
Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024
Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024
Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024
Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024
Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Nov 7, 2025
Can Any Model Use Skills? Adding Skills to Context-Bench

Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Oct 30, 2025
Context-Bench: Benchmarking LLMs on Agentic Context Engineering

We are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Aug 27, 2025
Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025
Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025
Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025
Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025
Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.