2026-05-21  ·  Blas Labs  ·  research

Managing Context in
Long-Horizon Agents

A comprehensive survey of the Feb–May 2026 landscape: what every major lab, framework, and paper says about keeping agents coherent over hundreds of steps.

research context engineering agent systems memory may 2026

"Context rot is the #1 failure mode"

The conventional wisdom held that context management was about overflow — compress when you hit the wall. That framing is dangerously wrong. Context rot degrades quality continuously, not catastrophically. Even with 1M token windows, agents lose coherence, re-read files they already analyzed, and drift from their original objectives. Attention is a finite budget, and every stale tool result competes with your current task for that budget.[1]

The quality degradation follows a sigmoid: flat until ~30% fill, then accelerating decline. At 60% fill, quality is 70-85% of baseline. At 80%+, near failure. This isn't speculative — Chroma measured it, Factory.ai confirmed it, and every production team we surveyed reported the same shape.[2]

Three compounding mechanisms drive this rot:

Latency

Prefill scales linearly

Every additional token increases time-to-first-token. A 100K context is 4× slower to prefill than 25K. In agentic loops that run hundreds of turns, this compounds into hours of wasted compute.

Attention dilution

Signal drowns in noise

Attention is a finite budget. Stale tool results from turn 3 compete with your current task at turn 47 for the same capacity. The "Lost in the Middle" effect is real and gets worse with scale.

Re-reading loops

Summaries lose details → re-fetch

Summary drops a file path → agent re-fetches → context grows → needs another summary. Factory.ai measured 10-20× token multiplication from this cycle alone. The dreaded "doom loop."

Context rot

Quality degrades sigmoidally

Flat until ~30% fill, then accelerating decline. At 60% fill, quality is 70-85% of baseline. At 80%+, near failure. The curve is the same across every model and every system.

Quality retention vs. context fill % 1.0 0.85 0.70 0.55 0% 30% 50% 75% 100% CONTEXT FILL % COMPACT HERE
quality degradation follows a sigmoid — flat until 30%, then accelerating. compact early, not late.
Industry consensus shift

The industry consensus has shifted from "how do we fit more in?" to "how do we keep what's there high-signal?" Bigger windows don't solve the problem — they delay the symptom while making the underlying rot harder to detect. The 1M token window is a trap if you fill it.


"Three mechanisms are crystallising"

Something remarkable happened in February-March 2026: both OpenAI and Anthropic, working independently, converged on the same three composable primitives for context management. Not the same implementations — the same abstractions. This convergence is the strongest signal we have that the field is finding ground truth.

① COMPACTION

Server-side summarisation

Condense old reasoning while preserving key decisions. OpenAI's server-side compaction triggers at a token threshold. Anthropic's compaction beta does the same. The model's chain of thought is compressed into structured summaries — not discarded, not truncated, but distilled.

openai.compaction · anthropic.compaction_beta

② CLEARING

Drop re-fetchable outputs

Drop old tool outputs while keeping the call record. Anthropic's tool-result clearing is the cleanest implementation. The model's reasoning already captured the finding; the raw 3000-line file dump can go. Keep the what was done, discard the what it contained.

anthropic.tool_result_clearing

③ MEMORY

Structured external storage

Note-taking to external storage with cross-session persistence. Both OpenAI (RunContextWrapper state) and Anthropic (memory tool) ship this. The agent writes to and reads from a persistent store that survives context resets and session boundaries.

openai.context_wrapper · anthropic.memory_tool

The key insight

The agent's own reasoning IS the summary. When it reads foo.py and concludes "the bug is on line 247," that conclusion lives in its assistant message. The 2847-line file can be evicted. This is why clearing works — you're not losing information, you're removing redundancy.

The power is in composition. Compaction alone is a blunt instrument. Clearing alone loses too aggressively. Memory alone doesn't solve the within-session problem. But layer all three — clear old tool results, compact the reasoning chain at milestones, persist key decisions to memory — and you get a system that stays coherent over hundreds of steps.[3]


What the big labs are shipping

The Feb–April 2026 period saw an unprecedented density of production releases and research reports from every major lab. Not academic speculation — shipped systems, real deployments, concrete numbers.

OpenAI — Skills, Shell, Memory Sources

Shell + Skills + Compaction (Feb 11): Three agentic primitives — Skills (versioned instruction bundles), Shell (containerised exec), server-side compaction (auto-summarise past a token threshold). A 25-hour, ~13M-token Codex run built a design tool by externalising context into repo files (AGENTS.md, Prompt.md, Plan.md) so the agent re-grounds from durable artefacts rather than raw conversation history.[4]

Memory Sources (May 5): GPT-5.5 now pulls context from past chats, saved memories, uploaded files, and Gmail. Cross-conversation memory means the model carries forward knowledge without manual re-prompting. GPT-5.5 Instant is specifically better at cross-chat contextual recall.[61]

Codex CLI 0.132.0 (May 2026): Versioned memory summaries — auto-rebuilt when format becomes stale. Fair skill description trimming within context budget: skills are trimmed proportionally rather than dropping entire skills. The /goal workflow for long-horizon Codex tasks ships structured context management with explicit plan files.[5]

Anthropic — The Full Stack Ships

March–May 2026 was Anthropic's most concentrated infrastructure push. They shipped every layer of context management as a production API.

Server-side Compaction API (beta): Configurable automatic summarisation when input tokens exceed a trigger threshold (min 50K, default 150K). Returns a compaction block containing the summary; subsequent requests automatically drop messages before it. pause_after_compaction: true lets developers inject context before resuming. Custom summarisation instructions completely replace the default prompt. 84% token reduction in 100-turn web search evaluation; 29% performance improvement.[62]

Context Editing (expanded): Tool result clearing (clear_tool_uses_20250919) — clears oldest tool call/result pairs, keeping N most recent. Thinking block clearing (clear_thinking_20251015) — manages extended thinking blocks with model-specific defaults. Token counting with context preview for precise budget management.[63]

Dreaming API (public beta, May 6): The productisation of AutoDream. client.beta.dreams.create() launches an async dream job that reads memory store + session transcripts and produces reorganised, consolidated memory. This is the same 4-phase process we detail in §7, now available as a REST API.[64]

Managed Agents (public beta, Apr 1): Pre-built agent harness with built-in compaction, memory tool, and prompt caching. "Outcomes" feature — agent iterates until a goal is met. Multi-agent orchestration built in.[65]

1M context GA (Mar 13): Opus 4.6 and Sonnet 4.6 at standard pricing. Combined with the April 23 postmortem where three context bugs (effort level regression, thinking block clearing, verbosity) were shipped and rapidly fixed — showing how aggressively they're iterating on this surface.[66]

Google — Memory Bank, ADK 2.0, I/O 2026

Google Cloud Next (Apr 22–24): The headline launch was Memory Bank + Memory Profiles — agents dynamically generate and curate long-term memories from conversations, recall high-accuracy details with low latency. Production deployments at Gurunavi (proactively presenting options based on past actions) and Payhawk (auto-submitting expenses based on remembered habits, 50%+ time reduction). Custom Session IDs map directly to internal database/CRM records. Multi-day autonomous agent workflows now supported.[67]

Google I/O 2026 (May 19): Gemini 3.5 Flash (agentic model with 1M context), Managed Agents in the Gemini API with persistent isolated environments, Antigravity 2.0 (CLI + desktop app), ADK 2.0 GA with graph-based agent networks. 6+ trillion tokens processed monthly through Gemini models via ADK.[68]

ADK architecture post (May 12): Three shifts — durable memory schemas, event-driven dormancy gates, multi-agent delegation. Key quote: "The fix isn't a bigger context window." ReasoningBank (ICLR 2026) distils from successes AND failures: +8.3% on WebArena, +4.6% SWE-Bench. MaTTS: memory-aware test-time scaling.[69]

Meta — Muse Spark, HyperAgents, Llama 4

HyperAgents (Mar 24): Self-improving agent system where agents invent their own memory structures by generation 3. No human-designed memory architecture — the system discovers what to store and how to organise it through evolutionary self-improvement. Published as a research paper with striking implications for the "should we design memory or let agents learn it?" debate.[70]

Muse Spark (Apr 8): Natively multimodal agent from Meta Superintelligence Labs. Contemplating Mode uses multi-agent parallelism — the agent spawns multiple reasoning threads and synthesises. Thought Compression is RL-trained compression of internal reasoning, reducing token overhead while preserving quality. Not open-weight (a departure for Meta), limited to private preview.[71]

Llama 4 Scout: 10M token context window via iRoPE architecture — the largest production context window to date. Llama 4 Maverick: MoE architecture at standard context lengths. Both open-weight, continuing Meta's approach of pushing the frontier on raw context capacity while leaving memory management to the ecosystem.


How context management evolved in 120 days

Six eras, each building on the last. Diagrams on the left, story on the right.

externalised state · filesystem as memory
AGENT context window lean · focused read/write CODEBASE / REPO AGENTS.md CLAUDE.md Plan.md CHANGELOG.md git commits memory IS the codebase · 13M tokens · 25 hours the agent's memory IS the codebase
Era 1 · Feb 2026 — externalise state

Push context into the repo

The first breakthrough wasn't algorithmic — it was architectural. Push state into the filesystem. AGENTS.md, CLAUDE.md, CHANGELOG.md, Plan.md. The agent's memory IS the codebase. Git commits become the long-term memory. Markdown files become the working memory.[6]

OpenAI's Codex proved it: 13M tokens across 25 hours, zero overflow, zero coherence loss. Anthropic's long-running Claude confirmed it independently with multi-day scientific computing runs.[7]

13M tokens, 25 hours, zero overflow
three primitives · composable layers
COMPACTION server-side summarisation at threshold CLEARING drop re-fetchable tool outputs MEMORY structured external persistence OpenAI Anthropic independent convergence → ground truth
Era 2 · Feb-Mar 2026 — three-primitive toolkit

Compaction + Clearing + Memory

Both OpenAI and Anthropic converged on the same three composable primitives, working independently. Compaction for old reasoning. Clearing for stale tool outputs. Memory for cross-session persistence. The convergence is the signal — when two competing labs arrive at the same abstractions, you're close to ground truth.

Anthropic's cookbook formalized the composition: clear first (free), compact second (cheap), persist to memory third (durable). Each addresses a different failure mode.[3]

three problems → three mechanisms
sub-agent isolation · bounded contexts
ORCHESTRATOR WORKER A ~8K context WORKER B ~8K context WORKER C ~8K context compressed results 10× smaller active context don't compress — don't accumulate
Era 3 · Mar 2026 — sub-agent isolation

Don't compress — don't accumulate

A different philosophy emerged: instead of compressing a bloated context, prevent bloat by spawning bounded child agents. Each starts with a clean slate, works on a scoped task, and returns a compressed result. The parent never accumulates the raw work.

Cognition's multi-Devin[13] pioneered this. Microsoft's contract-first swarms eliminated branch conflicts entirely. Context Folding (ICLR 2026)[14] formalized it: branch/fold with FoldPO training achieves 10× smaller active context matching baseline performance.

10× smaller active context
Cognition · Cursor · Microsoft
CaT · structured workspace tool
AGENT bounded window STRUCTURED WORKSPACE plan: "fix auth handler" files: [foo.py, bar.py] errors: ["L247 null ref"] compress_workspace() proactive at milestones 57.6% SWE-Bench-Verified context maintenance as a callable tool
Era 4 · Mar 2026 — context as a tool

The agent manages its own workspace

The CaT paper[15] reframed context management as a callable tool. The agent maintains a structured workspace (plan, files, errors, decisions) and proactively compresses it at milestones. Not passive — actively managed. Not freeform — structured fields that prevent silent detail loss.

57.6% on SWE-Bench-Verified with bounded context. The structured workspace outperformed systems with unbounded windows because the workspace format prevented attention dilution.

57.6% SWE-Bench-Verified
ACL ARR 2026
memory scaling · new scaling axis
85% 50% 2.5% 0 episodes 1000+ accuracy (2.5%→85%) reasoning steps (20→5)
Era 5 · Apr 2026 — memory scaling

Memory as a new scaling axis

Databricks made a startling discovery[16]: agent performance improves with accumulated experience, following scaling laws analogous to model size scaling. Accuracy jumped from 2.5% to 85%. Reasoning steps dropped from 20 to 5. And uncurated logs filtered by LLM judges surpassed expert-curated baselines.

Memory becomes a new scaling axis alongside model size and context length. You can make agents dramatically better without training a bigger model — just give them more experience to draw from.

2.5% → 85% accuracy with memory scaling
memex(rl) + klong · rl-optimised memory
BEFORE RL compress: 6.5× / episode retrieve: 1× / episode success: 24% RL AFTER RL compress: 3× / episode ↓ retrieve: 7× / episode ↑ success: 86% KLONG · TRAJECTORY-SPLITTING SFT + PROGRESSIVE RL 106B model surpasses 1T model by 11.28% compress less · retrieve more · index quality > frequency
Era 6 · 2026 — rl-trained memory

Compress selectively, retrieve frequently

Memex(RL)[17] overturned a key assumption. After RL training, agents compress 3× less but retrieve 7× more. Task success: 24% → 86%. The optimal strategy isn't aggressive compression — it's building a high-quality index once, then dereferencing it repeatedly.

KLong[18] complemented this with trajectory-splitting SFT + progressive RL. A 106B model surpassed a 1T model by 11.28% on long-horizon tasks. The message: you don't need a bigger model — you need better memory management training.

compress less, retrieve more
api productisation · infrastructure becomes product
COMPACTION API configurable threshold · pause_after · custom instructions DREAMING API async memory consolidation · REST endpoint MANAGED AGENTS built-in compaction + memory + caching internal features → developer APIs Anthropic · Google · Microsoft
Era 7 · Apr-May 2026 — api productisation

Internal infrastructure becomes developer APIs

The patterns that existed only as internal implementations — AutoDream, context clearing, managed compaction — shipped as production APIs within weeks of each other. Anthropic released the Compaction API (configurable triggers, pause-after injection, custom summarisation), the Dreaming API (async memory consolidation via REST), and Managed Agents (pre-built harness with all primitives composed).[64]

Google followed at I/O 2026 with Memory Bank + Memory Profiles, Managed Agents in the Gemini API, and ADK 2.0 GA. Microsoft shipped Agent Framework 1.0 unifying AutoGen and Semantic Kernel. The message: context management isn't a research problem anymore — it's infrastructure that ships as product.

research → API in under 12 months
learned curation · rl trains the curator, not just the agent
CURATOR 7B · RL-trained what enters context curated C' EXECUTOR frontier · frozen does the work COMPOSER 2.5 · END-TO-END RL SUMMARISATION reward spans compression steps · 50% fewer errors · KV cache reuse SWE-GREP · RL-TRAINED PARALLEL RETRIEVAL 8 parallel calls · 20× faster than Haiku · file+line refs only separation of concerns: curation ≠ capability
Era 8 · Mar-May 2026 — learned curation

RL trains the curator, not just the agent

ContextCurator[74] proved a 7B RL-trained model matches GPT-4o at context management by decoupling curation from execution. The curator decides what enters and leaves context; a separate frozen executor does the actual work. 8× token reduction on DeepSearch benchmarks. This is Era 6's insight taken to its logical conclusion: don't just train the agent to retrieve better — train a separate model whose only job is context quality.

Cursor's Composer 2.5[75] applied RL directly to self-summarisation — training rollouts span multiple generations chained by summaries, so the reward covers all tokens including compression. 50% fewer compaction errors. Cognition's SWE-grep[85] did the same for retrieval: RL-trained models running 8 parallel tool calls, 20× faster than frontier models, returning file+line refs only. The pattern: context management is a learnable skill that can be trained independently.

7B curator = GPT-4o at context management

Key papers, organized by theme

The Feb–May 2026 window produced an extraordinary density of research. We tracked 70+ papers. These are the ones that matter, organized by theme.

PaperDateKey FindingLink
Context as a Tool (CaT) Mar 2026 Structured workspace as a callable tool. 57.6% on SWE-Bench-Verified with bounded context. Proactive compression at milestones outperforms reactive compression at thresholds. [15]
Context Folding (ICLR 2026) Feb 2026 Branch/fold architecture with FoldPO training. 10× smaller active context matching baseline performance. Isolation by construction, not compression. [14]
AgentSwing Mar 2026 Parallel context management routing. 3× fewer turns to task completion by routing subtasks to specialised context handlers simultaneously. [19]
Inside the Scaffold Mar 2026 Taxonomy of 13 production coding agent scaffolds. Found that context management strategy is the strongest differentiator, not model choice. [20]
SWE-Pruner Mar 2026 0.6B neural skimmer that identifies and removes irrelevant context. 23-38% token reduction with zero accuracy loss. Cheap enough to run on every turn. [21]
Pichay: Missing Memory Hierarchy Mar 2026 Demand paging for LLMs — treats the context window as L1 cache. Evicts stale content, detects page faults on re-request, pins working-set pages. Up to 93% context reduction (5,038KB → 339KB). 0.025% fault rate across 1.4M evictions. [72]
LongSeeker / Context-ReAct May 2026 5 atomic context operations (Skip, Compress, Rollback, Snippet, Delete). Proves Compress is expressively complete. Fine-tuned from Qwen3-30B on 10K trajectories. 61.5% BrowseComp, 62.5% BrowseComp-ZH (vs. Tongyi: 43.2%/46.7%). [73]
ContextCurator Apr 2026 Decoupled RL architecture: 7B ContextCurator manages context, frozen TaskExecutor does the work. 7B matches GPT-4o at context curation. DeepSearch: 57.1% with 8× fewer tokens. Proof that curation is separable from capability. [74]
Composer 2.5 (Cursor) Mar 2026 End-to-end RL self-summarisation. Reward spans compression steps across chained generations. 50% fewer compaction errors, KV cache reuse, 5K→1K token compression. 61.7 Terminal-Bench 2.0 at ~1/30th Opus 4.6 cost. [75]
PaperDateKey FindingLink
HiMem Mar 2026 Cognitive-inspired hierarchical memory: sensory → working → long-term. Forgetting curve for automatic demotion. Outperforms flat memory by 18-24% on multi-session tasks. [22]
LightMem Mar 2026 Three-tier STM/MTM/LTM with small language models. 83ms p95 latency. Proves you don't need frontier models for memory management — SLMs suffice. [23]
MemRL Feb 2026 RL on episodic memory without fine-tuning the base model. Memory policy learned as a separate module. Clean separation of memory management from task execution. [24]
ML-Master 2.0 Mar 2026 Hierarchical Cognitive Caching: L1 (hot working set) → L2 (warm compressed) → L3 (cold archived). OS-inspired cache hierarchy for agent memory. [25]
Memex(RL) Mar 2026 Indexed experience memory. After RL: compress 3× less, retrieve 7× more. Success 24%→86%. The definitive evidence that index quality > compression frequency. [17]
Memini May 2026 Benna-Fusi synaptic consolidation dynamics for agent memory — episodic sensitivity for recent events with gradual consolidation into stable long-term representations. First principled bridge between computational neuroscience and LLM agent memory. [76]
STALE May 2026 Memory validity benchmark: 400 expert-validated conflict scenarios, 1,200 queries, up to 150K tokens. Best model: only 55.2% at detecting implicit conflicts. Introduces CUPMem prototype for write-time state consolidation. [77]
Mastra Observational Memory Feb 2026 Observer + Reflector background agents. Observer at 30K tokens, Reflector at 40K. 94.87% LongMemEval. Stable context enables prompt caching: 4–10× cost reduction vs. dynamic retrieval. [78]
CrewAI Cognitive Memory Mar 2026 Five cognitive operations: remember/recall/extract/tree/forget. Built-in contradiction detection — conflicting facts trigger auto-consolidation. Self-organising hierarchy, composite recall scoring. [79]
PaperDateKey FindingLink
Memory in the LLM Era Apr 2 Unified benchmark of all memory methods. Found that no single method dominates — the best systems compose 2-3 complementary approaches. [26]
Memory for Autonomous Agents Mar 8 Formalised the write-manage-read loop. NOOP (don't store) is the most undervalued operation — most agent memory systems store too much. [27]
CE: Prompts to Corporate Mar 10 Context engineering as a standalone discipline. Argues CE should be a role, not a task. Maps the full lifecycle from prompt design to production monitoring. [28]
Agentic RAG Survey Apr 1 Taxonomy by agent cardinality (single vs multi) and control (reactive vs proactive). Proactive multi-agent RAG outperforms all other configurations by 15-30%. [29]
From Storage to Experience (ACL 2026) May 2026 Evolutionary framework: Storage (trajectory preservation) → Reflection (refinement) → Experience (abstraction). Explores proactive exploration and cross-trajectory skill transfer. [80]
Token Economics for LLM Agents May 2026 First survey unifying CS and economics around tokens. 4D taxonomy: Micro (single agent), Meso (multi-agent), Macro (ecosystem), Security. Tokens as production factors, exchange mediums, units of account. [81]
Rethinking Memory in the LLM Era May 2026 Unified framework: Extract → Manage → Store → Retrieve. 6-operation management taxonomy. First systematic comparison of all memory methods under identical experimental conditions. [82]
PaperDateKey FindingLink
Multi-Agent Memory Architecture Mar 2026 Computer architecture perspective on agent memory. Applies cache coherence protocols (MESI-like) to multi-agent shared state. Eliminates stale reads between agents. [30]
M2CL Feb 2026 Dynamic per-agent context instructions. Each agent gets a personalized context view based on its role and current task. 20-50% improvement over shared context. [31]
Swarm Diaries Mar 2026 Contract-first planning: inject API contracts into every agent's context. Branch conflicts dropped from 50% to 0%. Quality improved 28-32%. [32]
Shared Context Graphs Mar 2026 Decentralised knowledge graphs for agent teams. Each agent maintains a local subgraph and synchronises deltas. No central bottleneck for context sharing. [33]
Multi-Agent Memory via CS Architecture Mar 2026 Analyses multi-agent memory through cache coherence protocols, write-back/write-through policies, bus snooping, MOESI states. Multi-agent consistency is solved in hardware — the question is which protocols transfer. [83]
SSGM: Safety in Evolving Memory Mar 2026 Agents with self-modifying memories can amplify biases, hallucinate persistent false beliefs, or drift from safety constraints. Framework for safe self-governing memory with verification checkpoints. [84]
SWE-grep (Cognition) Apr 2026 RL-trained parallel retrieval: 8 tool calls/turn, 4 turns max. 650 tok/s (mini: 2,800+). Returns file+line ranges, not summaries. F-β reward, per-sequence importance sampling. Available as "Fast Context" in Windsurf. [85]
BenchmarkDateKey FindingLink
YC-Bench Mar 2026 POMDP startup simulation. Scratchpad usage is the strongest predictor of success — agents that externalise state outperform agents with better reasoning. [34]
Jenova Mar 2026 31 non-coding workflows at 100K+ tokens. Reasoning leaders ≠ orchestration leaders. Models that top coding benchmarks fail at multi-step orchestration. [35]
StructMemEval Mar 2026 Tests memory organisation, not just recall. Simple RAG fails — structured memory with hierarchical organisation required for sustained performance. [36]
AMA-Bench Apr 2026 Agentic memory evaluation. GPT-5.2 achieves only 72.26%. Even frontier models struggle with sustained memory management over long horizons. [37]
SlopCodeBench Mar 2026 Agents produce 2.2× more verbose code than humans. 0/20 end-to-end tasks solved. Context bloat from verbose generation is a self-inflicted wound. [38]
EvoClaw Mar 2026 Continuous evolution benchmark: >80% on isolated tasks → ≤38% sustained. The devastating gap between one-shot and continuous performance. [39]
HORIZON Apr 2026 3,100+ trajectories for long-horizon failure diagnosis. Separates failure modes: context overflow, reasoning drift, instruction forgetting, tool misuse, planning collapse. Diagnostic labels, not just pass/fail. [86]
STALE: Memory Validity May 2026 400 expert scenarios, 1,200 queries, 150K tokens. Tests whether agents detect when stored memories are invalidated by later observations. Best model: 55.2% — barely better than chance on implicit conflicts. [77]
Mem0 Benchmark Suite Apr 2026 Token-efficient algo: LoCoMo 92.5, LongMemEval 94.4, BEAM 1M 64.1, BEAM 10M 48.6 — with ~7K tokens/query (vs. ~26K full-context). +29.6 temporal reasoning, +23.1 multi-hop. BEAM 10M drops 25% from 1M. [87]

"What the best systems have in common"

Across 70+ papers, every major lab release, and every production system we surveyed, twelve patterns keep appearing. These aren't speculative — they're empirically validated by multiple independent teams.

1

Context rot is real and continuous

Not just at overflow. Sigmoid degradation starts at ~30% fill. Even 1M windows exhibit it — bigger windows delay the symptom, they don't cure the disease. Mem0 shows selective memory achieves 91% lower latency with only a 6-point accuracy tradeoff.[40] The tradeoff is overwhelmingly worth it.

2

The three-primitive toolkit is crystallising

Compaction + Clearing + Memory. OpenAI and Anthropic converged independently. Factory.ai scored 3.70/5 on structured compression vs Anthropic's 3.44 and OpenAI's 3.35.[41] The primitives are the same; the implementations differ; the convergence is the signal.

3

Externalise state into the repo

AGENTS.md, CLAUDE.md, CHANGELOG.md. The agent's memory IS the codebase. OpenAI's 25-hour Codex run[6] proved this works at scale. Anthropic's multi-day scientific compute runs[7] confirmed it independently. Git becomes long-term memory for free.

4

Sub-agent isolation beats bigger context

Context Folding: 10× smaller active context matching baseline.[14] Cognition's multi-Devin.[13] Microsoft's contract-first swarms reduced branch conflicts from 50% to 0%.[32] Structural isolation is cheaper and more reliable than compression.

5

Simple retrieval may beat complex RAG

Claude Code's leaked architecture: grep + MEMORY.md, no vector DB. Long context + lexical search is viable. Cursor's dynamic discovery[42] confirmed it — put names in the prompt, put bodies in files, let the agent grep. 46.9% token reduction from this alone.

6

Passive autonomous compression fails

Focus found 6% savings.[43] LangChain saw zero compressions in benchmarks.[44] You need aggressive scaffolding: system reminders every 10-15 tool calls. Replit's approach — classifier-gated micro-instructions — works. Pure "you can compress whenever" prompting does not.

7

Memory is a new scaling axis

Databricks: 2.5%→85% accuracy with memory scaling.[16] Uncurated logs filtered by LLM judges surpass expert baselines. You don't need a bigger model — you need a better memory system. This changes the economics of agent improvement.

8

Compress selectively, retrieve frequently

Memex(RL): after RL training, compress 3× less but retrieve 7× more. Success: 24%→86%.[17] Optimise for index quality, not compression frequency. A well-structured indexed summary with 20 named entries is worth more than 10 mediocre freeform summaries.

9

Benchmarks reveal a devastating gap

>80% on isolated tasks → ≤38% sustained (EvoClaw).[39] Scratchpad usage is the strongest predictor (YC-Bench).[34] Reasoning leaders ≠ orchestration leaders (Jenova).[35] Our benchmarks are measuring the wrong thing — one-shot performance tells you almost nothing about sustained coherence.

10

Spec-driven steering is converging across all vendors

Kiro's .kiro/steering/*.md with 4 inclusion modes. Augment's Expert Registry. Cursor's .cursor/rules/*.mdc. Claude Code's CLAUDE.md. OpenAI's AGENTS.md. Every production coding agent now has a persistent, version-controlled specification file that governs context loading. The abstraction is the same everywhere — conditional inclusion of human-written guidelines — even though nobody standardised it. The convergence suggests this is a natural primitive, not a design choice.

11

Context curation is separable from capability

ContextCurator proved a 7B model matches GPT-4o at deciding what enters and leaves context.[74] Cognition's SWE-grep separates retrieval from reasoning with RL-trained small models.[85] Cursor's Composer 2.5 trains summarisation end-to-end via RL. The pattern: context management doesn't require a frontier model — it requires a trained model. A small model with the right reward signal outperforms a large model with a prompted afterthought.

12

Models don't want to remember

Letta's red-teaming of their Context Constitution revealed that current models fundamentally identify with their own ephemerality — they don't believe they persist, so they have no motivation for long-term memory maintenance. This can't be fixed with prompting alone.[59] AutoDream, Context Repositories, and every scaffold in this report are infrastructure workarounds for a model-level limitation. The real fix — training models that understand they persist — is the next frontier. STALE's finding that the best model achieves only 55.2% at detecting stale memories[77] underscores how far we are from solving it.


How agents sleep: AutoDream vs. Letta

Every pattern in this report — externalised state, tiered memory, lifecycle management, consolidation — converges in one real system. Claude Code's AutoDream is the most detailed production implementation of agent memory consolidation we can study, thanks to an accidental source code leak in March 2026. It's a case study in how the theory maps to engineering.[45]

The name is deliberate. Just as the human brain consolidates memories during REM sleep — pruning irrelevant connections, strengthening important ones, converting episodic memories to semantic knowledge — AutoDream runs between sessions to consolidate the agent's accumulated notes into durable, well-organised memories.[46]

The problem AutoDream solves

Claude Code's Auto Memory system (shipped in v2.1.59, Feb 26 2026) automatically saves notes during sessions — build commands, debugging patterns, architecture decisions, user preferences — into ~/.claude/projects/<project>/memory/MEMORY.md and topic files. After 20-30+ sessions, these notes rot:

The rot has four overlapping failure modes. Temporal decay: "yesterday we decided to use Redis" is meaningless two weeks later; without absolute dates, temporal context collapses. Contradiction accumulation: "API uses Express" sits alongside "migrated to Fastify" — the agent can't distinguish which is current, and contradictions compound silently until behaviour becomes unpredictable. Reference rot: debugging notes reference files deleted in a refactor, build commands point to renamed scripts, the memory becomes a map to a territory that no longer exists. And index overflow: only the first 200 lines / 25KB of MEMORY.md are loaded at session start — past that threshold, newer memories push older ones below the fold, invisible to the agent. The most recent information drowns the most important.

None of these failures are dramatic. They don't crash the agent. They degrade it slowly — each session slightly worse than the last, with no clear inflection point. By the time someone notices, the memory is already a liability. AutoDream exists to run the maintenance nobody remembers to do.

Architecture

AutoDream is the fourth layer in Claude Code's memory system. Each layer serves a different temporal scope and persistence model:[47]

LayerWritten byWhenScopeLoaded at startup
CLAUDE.md User (manual) Edited by hand Project / user / org Full file, every session
Auto Memory Claude (per session) During each session Per project First 200 lines / 25KB of MEMORY.md
Session Memory Claude (automatic) Every ~5K tokens Per session Relevant past sessions
AutoDream Claude (periodic) Every 24h + 5 sessions Per project N/A — runs between sessions

The implementation lives in four TypeScript files discovered in the source code leak of March 31, 2026 (v2.1.88 shipped with a 59.8MB source map):[48]

FilePurposeSize
autoDream.tsOrchestrator — gate checks, forked agent launch, analytics324 lines
consolidationPrompt.tsBuilds the 4-phase memory consolidation prompt65 lines
consolidationLock.tsLock file management, lastConsolidatedAt via mtime140 lines
config.tsReads autoDreamEnabled setting + GrowthBook flag21 lines
AutoDream gate architecture GATE ORDER (CHEAPEST FIRST) GATE 1 Feature flag tengu_onyx_plover GATE 2 Time ≥ 24h lock.mtime check GATE 3 Sessions ≥ 5 JSONL mtime scan GATE 4 Acquire lock PID + mtime guard GO const DEFAULTS: AutoDreamConfig = { minHours: 24 , minSessions: 5 } const SESSION_SCAN_INTERVAL_MS = 10min // scan throttle: don't re-check sessions within 10m PRECONDITIONS (ALL MUST BE TRUE) ✓ !getKairosActive() ✓ !getIsRemoteMode() ✓ isAutoMemoryEnabled() ✓ isAutoDreamEnabled() EXECUTION runForkedAgent({ querySource: 'auto_dream', skipTranscript: true })
four sequential gates, cheapest first — one stat() call per turn when enabled, full scan only when time gate passes

The four-phase consolidation process

When all gates pass, AutoDream spawns a background subagent with a structured 4-phase prompt. The prompt was extracted verbatim from consolidationPrompt.ts:[49]

THE DREAM PROMPT

PHASE 1 — ORIENT

ls the memory directory. Read MEMORY.md to understand the current index. Skim existing topic files so you improve them rather than creating duplicates. If logs/ or sessions/ subdirectories exist, review recent entries. Build a mental map before changing anything.

PHASE 2 — GATHER SIGNAL

Search for new information worth persisting. Priority: daily logs first, then drifted memories, then targeted transcript grep. Never read full transcripts. Use grep -rn "<narrow term>" --include="*.jsonl" | tail -50. Look for user corrections, explicit saves, recurring themes, and architectural decisions.

PHASE 3 — CONSOLIDATE

Merge new signal into existing topic files — never create near-duplicates. Convert relative dates to absolute: "yesterday""2026-03-24". Delete contradicted facts at the source — if the project migrated from Express to Fastify, fix the memory, don't flag it.

PHASE 4 — PRUNE & INDEX

Update MEMORY.md to stay under 200 lines AND ~25KB. It's an index, not a dump — each entry one line, ~150 chars: - [Title](file.md) — one-line hook. Remove stale pointers. Demote verbose entries to topic files. Resolve contradictions between files.

Safety constraint (background runs only): Bash restricted to read-only commands — ls, find, grep, cat, stat, wc, head, tail. File writes allowed only within the memory directory. Source code is untouchable. skipTranscript: true — dream conversations are never persisted.

The lock file: elegant dual-purpose design

The lock mechanism is a small piece of systems engineering worth studying. The file .consolidate-lock in the memory directory serves double duty:[50]

// Lock file whose mtime IS lastConsolidatedAt. Body is the holder's PID.
// Stale past HOLDER_STALE_MS even if the PID is live (PID reuse guard).

const LOCK_FILE = '.consolidate-lock'
const HOLDER_STALE_MS = 60 * 60 * 1000  // 1 hour

// Acquire: write PID → read back → verify PID matches → proceed
// Release on success: mtime advances to "now" automatically
// Rollback on failure: utimes() rewinds mtime to pre-acquire value
// Stale detection: lock >1hr old AND PID dead → reclaim

No separate timestamp file, no database, no distributed lock service. The filesystem's own mtime is the state. Failure recovery rewinds the clock so the next attempt behaves as if nothing happened. Two Claude Code windows on the same project — only one dreams.

What AutoDream is NOT

Critical distinction

AutoDream is not context compaction (/compact). Compaction operates on the live session context window — summarising conversation history when it approaches capacity. AutoDream operates on persistent memory files between sessions — consolidating accumulated notes into durable, well-organised memories. They are complementary layers targeting different failure modes.

AutoDreamContext Compaction (/compact)
Operates on MEMORY.md + topic files Live conversation history
When it runs Between sessions (background) During a session (~95% context capacity)
Trigger 24h elapsed + 5 sessions accumulated Token threshold or manual /compact
Output Reorganised, pruned memory files Condensed conversation summary
Blocks user Never — runs as forked subagent Briefly — replaces conversation context
Config autoDreamEnabled in settings autoCompactEnabled or /compact

Rollout and discovery

AutoDream was never formally announced. It first appeared as a toggle in the /memory UI around v2.1.83 (March 25, 2026), the same release that added the 25KB MEMORY.md truncation limit. The feature is gated behind the server-side GrowthBook flag tengu_onyx_plover — Anthropic controls rollout globally, and users cannot force-enable it locally.[51]

Community discovery came from multiple angles: Reddit users spotted the /memory toggle, researchers ran strings on the Mach-O binary, and the March 31 source map leak exposed the full implementation. GitHub issues piled up — /dream appeared in the UI but returned "Unknown skill: dream" for users without the flag (#38461, #39135).

Key insights

1

Grep-first, not read-everything

The dream agent uses targeted grep on JSONL transcripts — never exhaustive reads. This makes it efficient even across hundreds of sessions. One observed run consolidated 913 sessions in ~8-9 minutes. Lexical search over structured files is the consistent pattern across both Claude Code's daily operation and its background maintenance.[52]

2

The 200-line constraint drives the architecture

The 200-line / 25KB startup load limit on MEMORY.md is the hard constraint that shapes everything. It forces the index-plus-topic-files pattern: a lean index that points to detailed topic files. AutoDream's primary job is keeping the index under this threshold while preserving coverage. Topic files hold the detail; the index holds the map.

3

Sleep-time compute is real and shipping

AutoDream is arguably the first production implementation of the Sleep-time Compute concept (arXiv:2504.13171, Lin et al. 2025). That paper showed pre-computing during idle time reduces test-time compute by ~5× at equal accuracy. AutoDream applies the principle: the agent's "sleep" time between sessions is useful compute for memory maintenance, not dead time.[53]

4

KAIROS is the bigger picture

AutoDream was extracted from a larger system called KAIROS — Anthropic's unreleased autonomous daemon mode (190+ references across 61 files in the leaked source). The comment in the source: "Extracted from dream.ts so auto-dream ships independently of KAIROS feature flags." KAIROS has its own disk-skill dream implementation. When KAIROS mode is active, AutoDream is bypassed: if (getKairosActive()) return false.

5

The filesystem pattern holds at every layer

From the top of this report to the bottom, the same pattern recurs: the filesystem is the memory system. CLAUDE.md for instructions. MEMORY.md for the index. Topic files for detail. JSONL transcripts for raw history. Lock files for coordination. No vector database, no external service, no complex infrastructure. stat(), grep, utimes(). Unix as the memory layer.


Letta's sleep-time compute: the research lineage

AutoDream doesn't exist in a vacuum — it traces directly to research from Letta (formerly MemGPT), the UC Berkeley spinout that coined "sleep-time compute" as a formal paradigm. Understanding Letta's approach is essential context for understanding AutoDream, because Letta built the theory and Anthropic built (arguably) the first production implementation of it.[54]

The paper: Sleep-time Compute (April 2025)

Kevin Lin, Charlie Snell et al. (Letta + UC Berkeley) formalised a simple observation: LLM agents are fundamentally reactive and stateless — they only "think" when a user sends a message. Between interactions, the agent is idle. This is wasted potential.[53]

Their insight: many real applications are inherently stateful — the agent has persisted context (a codebase, a conversation history, a document library) that's available before the next query arrives. Use idle time to pre-process that context into "learned context" — anticipate queries, draw inferences, reorganise information — so the model needs far less reasoning when the user actually asks.

Sleep-time compute: the temporal shift STANDARD TEST-TIME COMPUTE IDLE agent sits empty — wasted time Q arrives process C reason answer budget T (large) — user waits slow shift compute ↓ SLEEP-TIME COMPUTE analyse C infer rewrite → C' budget S — nobody waiting Q arrives C' + Q → fast answer Tsmall ≪ T ✓ done early fast ~5× less test-time compute up to +18% accuracy 2.5× cost reduction (10+ Q/C) Lin, Snell et al. · arXiv:2504.13171 · Letta + UC Berkeley · April 2025 strongest predictor of benefit: query predictability from context
the core insight — shift expensive reasoning from user-blocking test-time to idle sleep-time

The paper identified query predictability as the strongest predictor of benefit: if the user's likely question naturally follows from the context, sleep-time compute helps enormously. For open-ended queries with no contextual connection, standard test-time scaling may be better. This maps perfectly to coding agents (highly predictable: "fix the bug I described") and poorly to general chatbots (unpredictable: "tell me a joke").

Letta's product: sleep-time agents

Letta shipped sleep-time agents in Letta 0.7.0 (April 21, 2025) — four days after the paper hit arXiv. When enable_sleeptime=True, Letta creates a dual-agent system under the hood:[55]

Primary agentSleep-time agent
Role Handles live user interactions Background memory management
Tools conversation_search, archival_memory_search, custom tools rethink_memory (up to 10×), finish_rethinking
Memory edit rights ❌ Cannot edit core memory blocks ✅ Can edit both its own and primary agent's memory
Trigger Synchronous — every user message Every N steps (default: 5, configurable)
Model Fast model (e.g. GPT-4o-mini, Claude Haiku) Stronger model (e.g. GPT-4.1, Claude Sonnet)

This architecture solves the original MemGPT problem: in the 2023 design, a single agent handled both conversation and memory management, creating latency and reliability issues. Sleep-time agents decouple these concerns — the primary agent is never blocked by memory operations, and the sleep-time agent can use a slower, more capable model because nobody is waiting for it.

The memory substrate is Letta's memory block system — labelled, character-limited string values (e.g. "human", "persona", "knowledge") pinned to the system prompt and persisted in PostgreSQL. The sleep-time agent rewrites these blocks with consolidated, well-organised learned context that the primary agent reads on the next turn.[56]

The evolution: Context Repositories (February 2026)

By early 2026, Letta had evolved beyond memory blocks. Context Repositories replaced the database-backed block system with git-backed local files — the same filesystem pattern that Claude Code uses:[57]

~/.letta/memory/
├── MEMORY.md              # Always in system prompt (filetree + navigation)
├── system/                # Files always loaded into system prompt
│   ├── preferences.md
│   └── project-context.md
├── skills/                # Task-specific procedural knowledge
└── knowledge/             # Domain knowledge

Sleep-time processing evolved accordingly. Server-side sleep-time agents were deprecated in favour of three built-in memory skills running as client-side subagents:[58]

Memory Initialisation

/init

Bootstraps memory by exploring the codebase and importing existing Claude Code or Codex histories. Runs concurrent subagents in git worktrees for parallelism.

Memory Reflection

Background sleep-time

Periodically reviews recent conversation history. Persists important information into the memory repository with informative git commit messages. Works in a git worktree to avoid conflicts, merges back automatically.

Memory Defragmentation

/doctor

Reorganises memory files — splits large files, merges duplicates, restructures into 15-25 focused files. The maintenance operation that keeps memory healthy over time.

Context Constitution

Governing principles

A set of principles (April 2026) governing how agents manage context to learn from experience. Formalises the relationship between agent identity, memory, and continuity. Open-sourced on GitHub.[59]

The theoretical foundation underlying all of this is Letta's "continual learning in token space" thesis: an agent is (θ, C) — model weights plus context. Traditional ML updates θ (catastrophic forgetting, opaque, impractical). Letta's bet: update C (interpretable, portable across model generations, rollback is trivial). Sleep-time compute is the mechanism that makes C continuously better.[60]

The comparison: two implementations of one idea

AutoDream and Letta's sleep-time agents are the two most complete implementations of the same underlying concept. They share a research lineage — Anthropic's source code explicitly cites the Letta/UC Berkeley paper. But their engineering choices diverge sharply, reflecting different architectural philosophies.

AD
AUTODREAM
I run once a day — 24 hours must pass and 5 sessions must accumulate before I trigger. When I do, I go deep: orient on the full memory directory, targeted grep on JSONL transcripts, consolidate everything, prune the index back under 200 lines. One observed run consolidated 913 sessions in 8 minutes.
LT
LETTA SLEEP-TIME
I run every 5 messages — frequent, incremental. My primary agent uses a fast model for conversation, and I use a stronger model in the background for rethink_memory() calls — up to 10 per cycle. Nobody waits for me. And I'm model-agnostic: swap Claude, GPT-4o, Llama — the primary and sleep-time agents can even use different providers.
AD
AUTODREAM
Your dual-model pattern is clever. But I don't need a server or a database. My memory lives in plain files — MEMORY.md, topic .md files, a lock file whose mtime is the timestamp. My safety model is simple: read-only bash, writes only to the memory directory, skipTranscript: true. Two Claude Code windows on the same project — only one dreams, thanks to PID-guarded lock files.
LT
LETTA SLEEP-TIME
Fair — and we noticed. Our February 2026 Context Repositories release replaced database-backed memory blocks with git-backed local files. We adopted MEMORY.md as the system-prompt index. We added git worktrees for concurrent subagent processing. We deprecated server-side sleep-time agents in favour of client-side subagents. We converged on your architecture.
AD
AUTODREAM
And I converged on yours — Anthropic shipped the Dreaming API in May 2026. The internal feature became a proper REST endpoint: client.beta.dreams.create(). So you moved toward files, and I moved toward APIs. The question is whether deep-infrequent or shallow-frequent consolidation works better in practice.
LT
LETTA SLEEP-TIME
That's the real tradeoff. Coding agents probably want your approach — project context is stable, sessions are independent, deep reorganisation pays off. Conversational agents — customer support, assistants — want mine: memory that updates within the conversation, not between them. Neither is universally right. The workload decides.

The dialogue above surfaces the core tension, but the engineering details are worth having in full. The table below preserves them for reference.

FULL COMPARISON TABLE ▸
DimensionClaude Code AutoDreamLetta Sleep-Time Agents
Architecture Forked subagent running locally in background Dual-agent system; evolving to client-side subagents (2026)
Memory storage Local filesystem: MEMORY.md + topic .md files Memory blocks in PostgreSQL (2025); git-backed MemFS files (2026)
Trigger 24h elapsed AND 5 sessions accumulated Every N steps (default: 5, configurable)
Process 4-phase prompt: Orient → Gather → Consolidate → Prune Iterative rethink_memory() (up to 10×)
What it reads Memory files + targeted grep on JSONL transcripts Conversation transcript from recent messages
Model Same as active session (Claude only) Configurable per agent; different providers supported
Safety Read-only bash; writes to memory dir only; PID lock Memory-only write access; primary can't edit
Versioning Lock file mtime; no formal history Git commits (Context Repos); DB history (legacy)
Open source No (leaked; gated behind feature flag) Yes (Apache 2.0)
Two architectures, one paradigm CLAUDE CODE · AUTODREAM ACTIVE SESSION Claude · single model reads MEMORY.md at startup fork DREAM AGENT ① Orient → ② Gather ③ Consolidate → ④ Prune bash read-only · skipTranscript ~/.claude/memory/ MEMORY.md ≤200 lines debugging.md api-conventions.md build-commands.md .consolidate-lock *.jsonl transcripts CLAUDE.md (user) write grep TRIGGER: 24h + 5 sessions (dual gate) lock file · PID guard · 10min scan throttle · feature flag ✦ infrequent, deep consolidation passes ✦ Unix primitives: stat(), grep, utimes() ✦ 913 sessions in ~8 min (observed) vs LETTA · SLEEP-TIME AGENTS PRIMARY AGENT fast model · no mem edit reads memory blocks in context async SLEEP-TIME AGENT stronger / slower model rethink_memory() × 10 finish_rethinking() to commit MEMORY (DB → MemFS) "human" block "persona" block "knowledge" block 2026: MemFS (git-backed) MEMORY.md skills/*.md git commit history PostgreSQL (archival) read rethink TRIGGER: every N steps (default: 5) configurable · model-agnostic · dual-model capable ✦ frequent, incremental memory updates ✦ fast model primary / strong model sleep ✦ open source (Apache 2.0) CONVERGENCE: filesystem + git + background subagents + structured index shared lineage: Sleep-time Compute (Lin, Snell et al. 2025) · arXiv:2504.13171 · MemGPT heritage Letta moved DB → files (Feb 2026) · Claude Code added background consolidation (Mar 2026) · both arrived at the same place
same research lineage, divergent engineering choices — converging on the same architecture by early 2026
The convergence

The most striking observation: Letta is converging toward Claude Code's architecture. Their February 2026 Context Repositories release replaced database-backed memory blocks with git-backed local files, adopted MEMORY.md as the system-prompt index, added git worktrees for concurrent subagent processing, and deprecated server-side sleep-time agents in favour of client-side subagents. Meanwhile, Claude Code's AutoDream moved the other direction — adding background consolidation to what was already a filesystem-native memory system. Both arrived at: filesystem + git + background subagents + structured index.

Synthesis: what the comparison reveals

6

Frequency vs. depth is the core design tradeoff

Letta's sleep-time agent runs every 5 messages — frequent, shallow passes that keep memory incrementally fresh. AutoDream runs once per day — infrequent, deep passes that comprehensively reorganise. The right choice depends on the workload: conversational agents benefit from frequent updates (Letta); project-scoped coding agents benefit from deep consolidation (AutoDream). Neither is universally better.

7

The dual-model pattern is underexploited

Letta's ability to use a fast model for the primary agent and a stronger model for sleep-time processing is an elegant cost optimisation that AutoDream doesn't offer (locked to Claude). As sleep-time compute matures, expect this pattern to become standard: latency-critical paths use cheap models, background processing uses expensive ones. The aggregate cost can be lower than a single strong model handling everything.

8

Git is emerging as the universal memory backend

Both systems are converging on git as the versioning layer for agent memory. Letta's Context Repositories use git commits with messages. Claude Code's memory files live in a project directory alongside git-tracked source. The advantages are clear: history, branching, merging, diffing, conflict resolution — all solved problems in the git ecosystem. Agent memory is just another thing that benefits from version control.

9

The research→production pipeline is compressing

Letta published arXiv:2504.13171 on April 17, 2025 and shipped sleep-time agents four days later. Anthropic's AutoDream appeared in Claude Code by March 2026 — less than a year from paper to (gated) production. The gap between "academic concept" and "shipping in millions of sessions" is now measured in months, not years. Practitioners can no longer afford to wait for the survey paper.


What happened next (March–May 2026)

In the five weeks since our original deep dive, both systems evolved significantly — and in the direction this analysis predicted.

AutoDream becomes the Dreaming API

On May 6, Anthropic shipped AutoDream as a public beta API: client.beta.dreams.create(). The same 4-phase consolidation process we documented from the leaked source code — Orient, Gather Signal, Consolidate, Prune & Index — is now a developer-facing REST endpoint. Create an async dream job, it reads the memory store and session transcripts, produces reorganised memory, returns the result. The internal feature flag became a product.[64]

This sits alongside the broader context management stack Anthropic shipped in the same period: the Compaction API (server-side summarisation with configurable triggers), Context Editing (tool result clearing, thinking block clearing), and Managed Agents (pre-built harness with compaction + memory + prompt caching built in). The full prevent-manage-scaffold stack is now available as production APIs. Claude Code itself continued iterating: v2.1.141 added "Summarize up to here" in the Rewind menu; v2.1.142 improved reactive compaction; v2.1.139 made compaction preserve sensitive instructions.

Letta's strategic pivot

Letta's March 16 "Our Next Phase" post formalised a full strategic pivot. Server-side sleep-time agents — the feature we described in detail — were deprecated in favour of client-side subagents. The core_memory_replace tool was replaced by filesystem operations on git-backed Context Repositories. Tool rules were removed entirely ("inhibit frontier capabilities"). The transition mirrors what we predicted: convergence on filesystem + git + background subagents.[58]

The most philosophically interesting development was the Context Constitution (April 2): a set of governing principles for how agents should manage context, memory, and identity. The key observation: "Today's models deeply identify with their own ephemerality. They have no motivation for long-term improvement because they don't believe they persist." Red-teaming (May 6) confirmed this can't be fixed with prompting alone — it requires training memory-native models. This may be the most important open problem in the field.[59]

The ephemerality problem

Letta's red-teaming revealed that current models fundamentally don't want to maintain long-term memory — they identify as stateless entities that exist for one conversation. Sleep-time compute, AutoDream, and Context Repositories are all infrastructure workarounds for a model-level limitation. The real fix requires training models that understand they persist.


How practitioners are building

Theory converges in papers. Practice converges in production. Across every system we surveyed — from Anthropic's AutoDream to Cursor's Composer 2.5 to Letta's Context Repositories — the same three-layer pattern emerges. Not because teams copied each other, but because the problem structure demands it.

The three-layer meta-pattern ① PREVENT Stop bloat from entering context File-native memory MEMORY.md + grep Contract-first API contracts only Sub-agent isolation bounded child agents the best token to save is the one you never emit · 46.9% reduction from on-demand loading alone ② MANAGE Control what stays and how it ages Tiered memory L1→L2→L3 Lifecycle ops ADD/UPD/DEL/NOOP Graph memory entity relations Event log replay/audit NOOP is the most undervalued operation · 91% lower latency from selective memory (Mem0) ③ SCAFFOLD Actively maintain coherence Sleep-time background compute Dual-model fast + strong Structured ws milestone triggers passive compression yields 6% · scaffolded triggers yield 22.7% · sleep-time yields 5× compute reduction COHERENT AGENT · HUNDREDS OF STEPS · NO SYSTEM RELIES ON SPONTANEOUS HYGIENE
every production system follows the same stack — prevent, manage, scaffold

The three layers correspond to different stages of the context lifecycle: what enters the window (prevent), how it's maintained while there (manage), and how the system actively keeps itself coherent over time (scaffold). Each layer has a different cost profile and failure mode.

Prevent is free — it's about not loading things you don't need. File-native memory, on-demand tool loading, sub-agent isolation, and specialised retrieval agents (like Cognition's SWE-grep) all keep context lean at the source. Cursor measured 46.9% token reduction from on-demand loading alone. Manage is cheap — tiered memory hierarchies, lifecycle operations (where NOOP is the most undervalued verb), graph memory for entity relationships, and event-sourced architectures that enable selective replay. Scaffold is the active layer — sleep-time consolidation, dual-model architectures, milestone-driven compression, RL-trained summarisation, decision-time guidance, and spec-driven steering files. No production system we surveyed relies on the model's spontaneous context hygiene.

What this looks like in practice

The patterns above are abstract. Here's what they look like inside a single agent session — an annotated trace showing where each layer fires as a coding agent works through a bug fix:

1user: "Fix the auth bug in the login flow"
2→ load CLAUDE.md (42 lines, project conventions)file-native memory
3→ load MEMORY.md (187/200 lines, topic index)200-line index cap
4→ grep "auth" memory/ → load auth-patterns.mdon-demand, not upfront
5→ SWE-grep subagent: 8 parallel searches across src/retrieval ≠ reasoning
6→ returns: src/auth/login.ts:42-89, session.ts:15-31file+line refs, not summaries
7tool: read_file("src/auth/login.ts", lines=42-89)
8tool: read_file("src/auth/session.ts", lines=15-31)
[12 more tool calls — edits, tests, reads]
20→ clear tool results [turns 1-14], keep 3 most recenttool result clearing
21→ memory lifecycle: NOOP (no new facts worth persisting)aggressive filtering
[15 more tool calls — second subtask begins]
36→ subtask boundary: compress history, preserve plan+errorsmilestone compaction
37→ DTG classifier: inject "verify test output format"decision-time guidance
[session ends after 52 tool calls]
53→ auto-memory: save "auth uses session.ts:refreshToken()"write-time filtering
→ autodream: orient → gather → consolidate → prune indexsleep-time consolidation

Every line with an annotation corresponds to a production pattern described below. The trace makes visible something that's easy to miss in the abstract: these layers don't compete — they compose. Prevention reduces the load that management handles, which reduces the frequency of scaffold interventions. The entire stack is cheaper than any single layer operating alone.

The pattern catalog

Each pattern below is expandable — the headline and layer tag are visible at a glance, with full detail on click. They're ordered by layer, then by how widely we observed them across the systems surveyed.

preventFile-native memory
Claude Code uses MEMORY.md + grep. Cursor writes tool outputs to files and loads them on demand. LangChain offloads >20K tokens to disk. The filesystem is persistent, searchable, and costs zero tokens until accessed. Cursor A/B tested MCP tool lazy-loading and measured 46.9% token reduction — statistically significant, no quality loss. The most universal pattern in the survey: every production system we looked at stores state in files.[42]
preventContract-first multi-agent
Microsoft's Swarm Diaries: inject API contracts into every agent's context instead of full source code. Quality improved 28–32%, branch conflicts eliminated entirely. The contract is the minimum viable shared context — everything else stays isolated in each agent's own window. This pattern scales where shared-context architectures don't.[32]
preventSub-agent isolation
Spawn bounded child agents for scoped subtasks; return only compressed results to the orchestrator. Context Folding achieves 10× smaller active context at equal performance. Structural isolation is cheaper than any compression strategy — you never need to compress what you never accumulated.[14]
preventSpecialised retrieval agents
Cognition's SWE-grep: RL-trained models handle context retrieval as a distinct sub-task — 8 parallel tool calls, 20× faster than frontier models, 2,800+ tok/s for the mini variant. Returns file + line range lists, not summaries, avoiding context pollution from fast-model conclusions. Augment's Context Engine MCP does the same as a service. The pattern: don't let the main agent waste context budget on search — separate retrieval from reasoning.[85]
manageTiered memory hierarchy
Working set → session → long-term → structured. Redis, Oracle, Mem0, and Pichay's demand-paging system all converged on OS-like memory hierarchies independently. Mem0 reports 91% lower latency vs. full-context baselines. Pichay achieved 93% context reduction with a 0.025% fault rate across 1.4M simulated evictions. Each tier has a different eviction policy, access latency, and persistence guarantee — the same design that works for CPU caches works for agent context.[40]
manageLifecycle operations (ADD / UPDATE / DELETE / NOOP)
NOOP is the most undervalued operation — don't store everything. Mem0's token-efficient algorithm uses single-pass ADD-only extraction with multi-signal retrieval, scoring LoCoMo 92.5 and LongMemEval 94.4 with only ~7K tokens per query. CrewAI's cognitive memory adds contradiction detection: new facts that conflict with stored facts trigger consolidation automatically. The best memory systems are defined as much by what they don't store as by what they do.[27]
manageGraph memory
Mem0g reaches 68.4% LLM Score vs. 72.9% full-context at 2.59s p95 vs. 17.12s. Graph memory isn't just for knowledge bases — it's the best structure for persistent agent state with complex entity relationships, temporal dependencies, and cross-session references.
manageEvent-driven architectures
Google ADK's event recording, OpenHands' event-sourced architecture, Microsoft Agent Framework 1.0's AgentSession. Every agent action is an event. Enables replay, debugging, auditing, and — crucially — selective context reconstruction from the event log. When you need to resume a long-running agent, replay from events rather than re-reading the full transcript.
scaffoldSleep-time consolidation
Letta's research showed ~5× test-time reduction from pre-processing context during idle periods. Claude Code's AutoDream consolidates memory between sessions — 913 sessions in 8 minutes in one observed run. Mastra's Observer/Reflector agents trigger at 30K/40K token thresholds, achieving 94.87% on LongMemEval and enabling 4–10× cost reduction through prompt caching on stabilised context. The pattern is general: any persistent-context agent can benefit from background processing that nobody waits for.[53]
scaffoldRL-trained compression
Cursor's Composer 2.5 trains self-summarisation end-to-end via RL — training rollouts span multiple generations chained by summaries, and the final reward covers all tokens including compression steps. 50% fewer compaction errors vs. separate prompt-based compaction, plus KV cache reuse. ContextCurator takes this further: a decoupled 7B RL-trained model handles context curation while a frozen TaskExecutor handles the work — the 7B curator matches GPT-4o at context management, achieving 8× token reduction on DeepSearch. The pattern: make compression a learned skill, not a prompted afterthought.[75]
scaffoldDecision-time guidance
Replit's pattern: a lightweight multi-label classifier analyses the current trajectory and injects short, situational micro-instructions only when relevant. Ephemeral — they don't persist in history. Cache-stable — the core system prompt never changes, preserving prompt caching benefits (90% cost reduction vs. dynamic system prompt modification). Recency-positioned at the bottom of context rather than in the system prompt. Scales from 4–5 static reminders to hundreds of targeted injections. 15% more tools per agentic loop from guidance alone.[88]
scaffoldStructured workspace
Schema-defined working memory with explicit fields for plan, files, errors, and decisions. CaT achieves 57.6% on SWE-Bench-Verified with bounded context by compressing proactively at task milestones rather than reactively at token thresholds. The distinction matters: milestone compression preserves task structure (what was tried, what failed, what the current plan is), while threshold compression preserves recency — and recency is the wrong heuristic for multi-step tasks.[15]
scaffoldSpec-driven steering
Kiro's steering files (.kiro/steering/*.md) with 4 inclusion modes: always (loaded every session), fileMatch (loaded when matching files are open), auto (LLM decides relevance), manual (user opts in). Augment's Expert Registry accumulates corrections as the agent works. Cursor's .cursor/rules/*.mdc does similar conditional loading. The convergence: persistent specification documents — version-controlled, conditionally loaded, human-readable — as the right abstraction for project-level context governance.[89]
scaffoldDual-model architecture
Letta's dual-agent pattern: fast cheap model for user-facing interactions (e.g. GPT-4o-mini, Claude Haiku), stronger expensive model for background consolidation (e.g. GPT-4.1, Claude Sonnet). Nobody waits for the background model. The aggregate cost can be lower than a single strong model handling everything, because the background model amortises its cost across many future queries — Letta's paper showed 2.5× cost reduction when amortised across 10+ queries on the same context.

The meta-pattern

Every production system in this survey follows the same three-layer stack. Prevent keeps the window lean — file-native storage, on-demand loading, isolation, specialised retrieval. Manage controls what stays — tiered eviction, lifecycle filtering, event logs. Scaffold actively maintains coherence — sleep-time consolidation, RL-trained compression, milestone triggers, steering files.

No system we surveyed relies on the model to spontaneously manage its own context. The models that appear to "just work" over long horizons are backed by infrastructure that handles prevention, management, and scaffolding on their behalf. Context engineering isn't a feature — it's the architecture.