Light
Dark
Research

Context-Bench: Benchmarking LLMs on Agentic Context Engineering

October 30, 2025

To see the full benchmark results, check the live leaderboard.

Modern AI agents have become increasingly adept at accessing files and tools to retrieve information — from searching via the web and MCP, to editing code with Bash and Unix tools, to more advanced use cases such as editing memories and loading “skills”. A critical challenge is determining what information should be in the agent's context window at any given time: too much information can cause context rot, while not enough information can cause hallucinations and poor performance. "Agentic context engineering" is when agents themselves (rather than humans) strategically decide what context to retrieve and load to accomplish a task.

Agentic context engineering is the new frontier in AI agents. Models that are post-trained specifically for context engineering will excel at long-horizon tasks where the task length far exceeds the native context windows of the LLMs themselves. So which models do it the best?

To answer this question, we are open-sourcing Context-Bench, which evaluates how well language models can chain file operations, trace entity relationships, and manage multi-step information retrieval in long-horizon tasks.

Context-Bench proves promising for the open source community: the gap between frontier open weights models and closed weights models appears to be closing. Unsurprisingly, frontier closed models that have been explicitly trained for context engineering such as Sonnet 4.5 still top the current leaderboard.

Introducing Context-Bench

Context-Bench measures an agent's ability to perform context engineering. Our goal is to construct a benchmark with the following properties

  1. Contamination Proof. Questions are generated from a SQL database with verified ground-truth answers, and all entity names and relationships are completely fictional to ensure models cannot rely on training data.
  2. Multi-Hop / Multi-Turn Tool Calling. Designed to require multiple tool calls and strategic information retrieval—agents cannot answer correctly without navigating file relationships.
  3. Controllable Difficulty. Unlike benchmarks that are quickly saturated, we can control the difficulty because the questions are generated from SQL queries. We can update the difficulty of the queries by changing the complexity of the SQL query to generate more challenging versions of the benchmark for future models.

Constructing Context-Bench

First we start by programmatically generating a database with facts about various entities with relationships between them e.g. people, pets, addresses, medical records. We then use an LLM to generate challenging, but tractable SQL queries against the database. Specifically, the question generator:

  1. Explores the database schema and considers the existing question corpus
  2. Ideate on a potential question by generating queries against the database
  3. Executes the queries to get the answers

Given the SQL queries and the execution results, we then use an LLM to convert these to natural language queries, and convert the structured database into semi-structured text. Questions in Context-Bench require the model to search through multiple files and understand the relationships between many large sets of facts. For example, agents might need to:

  • Find information about a person, then locate a related project, then identify a collaborator
  • Search for a specific attribute, verify it across multiple files, then trace connections to related entities
  • Navigate hierarchical relationships to answer questions about indirect connections
  • Compare many different sets of items and compare and contrast their attributes

Evaluating Agents on Context-Bench

To evaluate their ability for general-purpose context engienering, agents are given two tools:

  • open_files: Read the complete contents of a file
  • grep_files: Search for specific patterns within files

Context-Bench measures the ability of agents to:

  • Construct effective search queries
  • Chain file operations to trace relationships
  • Choose the right tool (grep vs. open)
  • Navigate hierarchical data efficiently

We track the total cost to run each model on the benchmark. Cost reveals model efficiency: models with higher per-token prices may use significantly fewer tokens to accomplish the same task, making total cost a better metric than price alone for evaluating production viability.

Models Trained for Context Engineering Excel

Claude Sonnet 4.5 leads the benchmark with a 74.0% score at $24.58, demonstrating exceptional ability to navigate complex information retrieval tasks. This model shows strong reasoning about what information to retrieve and how to chain lookups effectively. GPT-5 scores 72.67% at $43.56, showing competitive performance but at nearly twice the cost of Claude Sonnet 4.5. GPT-5-mini delivers solid performance at 64.33% and $12.45, making it an attractive option for cost-sensitive deployments.

Open-Weight Models Closing Gap to Closed-Weight Models 

One of the most encouraging findings is that open-weight models are rapidly catching up to proprietary models in context engineering capabilities:

  • GLM-4.6 from Zhipu AI achieves 56.83%, demonstrating that open-weight models can handle complex multi-step retrieval tasks.
  • Kimi K2 scores 55.13% at just $12.08, offering the best cost-per-point ratio among  open-weight  models and proving viable for production deployments.

Challenges Remain: Agentic Context Engineering as a New Frontier

While progress is encouraging, the benchmark also reveals areas where models struggle:

  • Nano models (GPT-4.1-nano at 16.2%, GPT-5-nano at 44.83%) show that smaller parameter counts still significantly impact complex reasoning
  • DeepSeek V3 (11.97%) and the GPT-OSS models (6.67% - 20.2%) demonstrate that not all open-weight models have achieved breakthrough performance
  • Even top models miss 25-30% of questions, indicating substantial room for improvement

Key Takeaways

  1. Models trained for context engineering, eg. Claude Sonnet 4.5 are recommended drivers production agents requiring strong context engineering, offering the best balance of performance and cost.
  2. The gap between proprietary and open-weight models has narrowed significantly. GLM-4.6 and Kimi K2 demonstrate that open-weight models can now handle complex agentic tasks, making them viable for many production use cases.
  3. Context engineering remains challenging. Even the best models achieve only 74% accuracy, highlighting the complexity of multi-step information retrieval and the opportunity for continued improvement.

What's Next

We're actively updating Context-Bench as newer models are released and will be expanding the benchmark with more tasks, particularly around continual learning tasks. We welcome community contributions to Context-Bench! Context-Bench is built on the open source Letta Evals framework.

To learn more, check out:

Jul 7, 2025
Agent Memory: How to Build Agents that Learn and Remember

Traditional LLMs operate in a stateless paradigm—each interaction exists in isolation, with no knowledge carried forward from previous conversations. Agent memory solves this problem.

Jul 3, 2025
Anatomy of a Context Window: A Guide to Context Engineering

As AI agents become more sophisticated, understanding how to design and manage their context windows (via context engineering) has become crucial for developers.

May 14, 2025
Memory Blocks: The Key to Agentic Context Management

Memory blocks offer an elegant abstraction for context window management. By structuring the context into discrete, functional units, we can give LLM agents more consistent, usable memory.

Feb 13, 2025
RAG is not Agent Memory

Although RAG provides a way to connect LLMs and agents to more data than what can fit into context, traditional RAG is insufficient for building agent memory.

Feb 6, 2025
Stateful Agents: The Missing Link in LLM Intelligence

Introducing “stateful agents”: AI systems that maintain persistent memory and actually learn during deployment, not just during training.

Nov 14, 2024
The AI agents stack

Understanding the AI agents stack landscape.

Nov 7, 2024
New course on Letta with DeepLearning.AI

DeepLearning.AI has released a new course on agent memory in collaboration with Letta.

Sep 23, 2024
Announcing Letta

We are excited to publicly announce Letta.

Sep 23, 2024
MemGPT is now part of Letta

The MemGPT open source project is now part of Letta.

Oct 23, 2025
Letta Evals: Evaluating Agents that Learn

Introducing Letta Evals: an open-source evaluation framework for systematically testing stateful agents.

Oct 14, 2025
Rearchitecting Letta’s Agent Loop: Lessons from ReAct, MemGPT, & Claude Code

Introducing Letta's new agent architecture, optimized for frontier reasoning models.

Sep 30, 2025
Introducing Claude Sonnet 4.5 and the memory omni-tool in Letta

Letta agents can now take full advantage of Sonnet 4.5’s advanced memory tool capabilities to dynamically manage their own memory blocks.

Jul 24, 2025
Introducing Letta Filesystem

Today we're announcing Letta Filesystem, which provides an interface for agents to organize and reference content from documents like PDFs, transcripts, documentation, and more.

Apr 17, 2025
Announcing Letta Client SDKs for Python and TypeScript

We've releasing new client SDKs (support for TypeScript and Python) and upgraded developer documentation

Apr 2, 2025
Agent File

Introducing Agent File (.af): An open file format for serializing stateful agents with persistent memory and behavior.

Jan 15, 2025
Introducing the Agent Development Environment

Introducing the Letta Agent Development Environment (ADE): Agents as Context + Tools

Dec 13, 2024
Letta v0.6.4 release

Letta v0.6.4 adds Python 3.13 support and an official TypeScript SDK.

Nov 6, 2024
Letta v0.5.2 release

Letta v0.5.2 adds tool rules, which allows you to constrain the behavior of your Letta agents similar to graphs.

Oct 23, 2024
Letta v0.5.1 release

Letta v0.5.1 adds support for auto-loading entire external tool libraries into your Letta server.

Oct 14, 2024
Letta v0.5 release

Letta v0.5 adds dynamic model (LLM) listings across multiple providers.

Oct 3, 2024
Letta v0.4.1 release

Letta v0.4.1 adds support for Composio, LangChain, and CrewAI tools.

Nov 7, 2025
Can Any Model Use Skills? Adding Skills to Context-Bench

Today we're releasing Skill Use, a new evaluation suite inside of Context-Bench that measures how well models discover and load relevant skills from a library to complete tasks.

Aug 27, 2025
Introducing Recovery-Bench: Evaluating LLMs' Ability to Recover from Mistakes

We're excited to announce Recovery-Bench, a benchmark and evaluation method for measuring how well agents can recover from errors and corrupted states.

Aug 12, 2025
Benchmarking AI Agent Memory: Is a Filesystem All You Need?

Letta Filesystem scores 74.0% of the LoCoMo benchmark by simply storing conversational histories in a file, beating out specialized memory tool libraries.

Aug 5, 2025
Building the #1 open source terminal-use agent using Letta

We built the #1 open-source agent for terminal use, achieving 42.5% overall score on Terminal-Bench ranking 4th overall and 2nd among agents using Claude 4 Sonnet.

May 29, 2025
Letta Leaderboard: Benchmarking LLMs on Agentic Memory

We're excited to announce the Letta Leaderboard, a comprehensive benchmark suite that evaluates how effectively LLMs manage agentic memory.

Apr 21, 2025
Sleep-time Compute

Sleep-time compute is a new way to scale AI capabilities: letting models "think" during downtime. Instead of sitting idle between tasks, AI agents can now use their "sleep" time to process information and form new connections by rewriting their memory state.