A comprehensive survey of the Feb–May 2026 landscape: what every major lab, framework, and paper says about keeping agents coherent over hundreds of steps.
The conventional wisdom held that context management was about overflow — compress when you hit the wall. That framing is dangerously wrong. Context rot degrades quality continuously, not catastrophically. Even with 1M token windows, agents lose coherence, re-read files they already analyzed, and drift from their original objectives. Attention is a finite budget, and every stale tool result competes with your current task for that budget.[1]
The quality degradation follows a sigmoid: flat until ~30% fill, then accelerating decline. At 60% fill, quality is 70-85% of baseline. At 80%+, near failure. This isn't speculative — Chroma measured it, Factory.ai confirmed it, and every production team we surveyed reported the same shape.[2]
Three compounding mechanisms drive this rot:
Every additional token increases time-to-first-token. A 100K context is 4× slower to prefill than 25K. In agentic loops that run hundreds of turns, this compounds into hours of wasted compute.
Attention is a finite budget. Stale tool results from turn 3 compete with your current task at turn 47 for the same capacity. The "Lost in the Middle" effect is real and gets worse with scale.
Summary drops a file path → agent re-fetches → context grows → needs another summary. Factory.ai measured 10-20× token multiplication from this cycle alone. The dreaded "doom loop."
Flat until ~30% fill, then accelerating decline. At 60% fill, quality is 70-85% of baseline. At 80%+, near failure. The curve is the same across every model and every system.
The industry consensus has shifted from "how do we fit more in?" to "how do we keep what's there high-signal?" Bigger windows don't solve the problem — they delay the symptom while making the underlying rot harder to detect. The 1M token window is a trap if you fill it.
Something remarkable happened in February-March 2026: both OpenAI and Anthropic, working independently, converged on the same three composable primitives for context management. Not the same implementations — the same abstractions. This convergence is the strongest signal we have that the field is finding ground truth.
① COMPACTION
Condense old reasoning while preserving key decisions. OpenAI's server-side compaction triggers at a token threshold. Anthropic's compaction beta does the same. The model's chain of thought is compressed into structured summaries — not discarded, not truncated, but distilled.
openai.compaction · anthropic.compaction_beta
② CLEARING
Drop old tool outputs while keeping the call record. Anthropic's tool-result clearing is the cleanest implementation. The model's reasoning already captured the finding; the raw 3000-line file dump can go. Keep the what was done, discard the what it contained.
anthropic.tool_result_clearing
③ MEMORY
Note-taking to external storage with cross-session persistence. Both OpenAI (RunContextWrapper state) and Anthropic (memory tool) ship this. The agent writes to and reads from a persistent store that survives context resets and session boundaries.
openai.context_wrapper · anthropic.memory_tool
The agent's own reasoning IS the summary. When it reads foo.py and concludes "the bug is on line 247," that conclusion lives in its assistant message. The 2847-line file can be evicted. This is why clearing works — you're not losing information, you're removing redundancy.
The power is in composition. Compaction alone is a blunt instrument. Clearing alone loses too aggressively. Memory alone doesn't solve the within-session problem. But layer all three — clear old tool results, compact the reasoning chain at milestones, persist key decisions to memory — and you get a system that stays coherent over hundreds of steps.[3]
The Feb–April 2026 period saw an unprecedented density of production releases and research reports from every major lab. Not academic speculation — shipped systems, real deployments, concrete numbers.
Shell + Skills + Compaction (Feb 11): Three agentic primitives — Skills (versioned instruction bundles), Shell (containerised exec), server-side compaction (auto-summarise past a token threshold). A 25-hour, ~13M-token Codex run built a design tool by externalising context into repo files (AGENTS.md, Prompt.md, Plan.md) so the agent re-grounds from durable artefacts rather than raw conversation history.[4]
Memory Sources (May 5): GPT-5.5 now pulls context from past chats, saved memories, uploaded files, and Gmail. Cross-conversation memory means the model carries forward knowledge without manual re-prompting. GPT-5.5 Instant is specifically better at cross-chat contextual recall.[61]
Codex CLI 0.132.0 (May 2026): Versioned memory summaries — auto-rebuilt when format becomes stale. Fair skill description trimming within context budget: skills are trimmed proportionally rather than dropping entire skills. The /goal workflow for long-horizon Codex tasks ships structured context management with explicit plan files.[5]
March–May 2026 was Anthropic's most concentrated infrastructure push. They shipped every layer of context management as a production API.
Server-side Compaction API (beta): Configurable automatic summarisation when input tokens exceed a trigger threshold (min 50K, default 150K). Returns a compaction block containing the summary; subsequent requests automatically drop messages before it. pause_after_compaction: true lets developers inject context before resuming. Custom summarisation instructions completely replace the default prompt. 84% token reduction in 100-turn web search evaluation; 29% performance improvement.[62]
Context Editing (expanded): Tool result clearing (clear_tool_uses_20250919) — clears oldest tool call/result pairs, keeping N most recent. Thinking block clearing (clear_thinking_20251015) — manages extended thinking blocks with model-specific defaults. Token counting with context preview for precise budget management.[63]
Dreaming API (public beta, May 6): The productisation of AutoDream. client.beta.dreams.create() launches an async dream job that reads memory store + session transcripts and produces reorganised, consolidated memory. This is the same 4-phase process we detail in §7, now available as a REST API.[64]
Managed Agents (public beta, Apr 1): Pre-built agent harness with built-in compaction, memory tool, and prompt caching. "Outcomes" feature — agent iterates until a goal is met. Multi-agent orchestration built in.[65]
1M context GA (Mar 13): Opus 4.6 and Sonnet 4.6 at standard pricing. Combined with the April 23 postmortem where three context bugs (effort level regression, thinking block clearing, verbosity) were shipped and rapidly fixed — showing how aggressively they're iterating on this surface.[66]
Google Cloud Next (Apr 22–24): The headline launch was Memory Bank + Memory Profiles — agents dynamically generate and curate long-term memories from conversations, recall high-accuracy details with low latency. Production deployments at Gurunavi (proactively presenting options based on past actions) and Payhawk (auto-submitting expenses based on remembered habits, 50%+ time reduction). Custom Session IDs map directly to internal database/CRM records. Multi-day autonomous agent workflows now supported.[67]
Google I/O 2026 (May 19): Gemini 3.5 Flash (agentic model with 1M context), Managed Agents in the Gemini API with persistent isolated environments, Antigravity 2.0 (CLI + desktop app), ADK 2.0 GA with graph-based agent networks. 6+ trillion tokens processed monthly through Gemini models via ADK.[68]
ADK architecture post (May 12): Three shifts — durable memory schemas, event-driven dormancy gates, multi-agent delegation. Key quote: "The fix isn't a bigger context window." ReasoningBank (ICLR 2026) distils from successes AND failures: +8.3% on WebArena, +4.6% SWE-Bench. MaTTS: memory-aware test-time scaling.[69]
HyperAgents (Mar 24): Self-improving agent system where agents invent their own memory structures by generation 3. No human-designed memory architecture — the system discovers what to store and how to organise it through evolutionary self-improvement. Published as a research paper with striking implications for the "should we design memory or let agents learn it?" debate.[70]
Muse Spark (Apr 8): Natively multimodal agent from Meta Superintelligence Labs. Contemplating Mode uses multi-agent parallelism — the agent spawns multiple reasoning threads and synthesises. Thought Compression is RL-trained compression of internal reasoning, reducing token overhead while preserving quality. Not open-weight (a departure for Meta), limited to private preview.[71]
Llama 4 Scout: 10M token context window via iRoPE architecture — the largest production context window to date. Llama 4 Maverick: MoE architecture at standard context lengths. Both open-weight, continuing Meta's approach of pushing the frontier on raw context capacity while leaving memory management to the ecosystem.
Six eras, each building on the last. Diagrams on the left, story on the right.
The first breakthrough wasn't algorithmic — it was architectural. Push state into the filesystem. AGENTS.md, CLAUDE.md, CHANGELOG.md, Plan.md. The agent's memory IS the codebase. Git commits become the long-term memory. Markdown files become the working memory.[6]
OpenAI's Codex proved it: 13M tokens across 25 hours, zero overflow, zero coherence loss. Anthropic's long-running Claude confirmed it independently with multi-day scientific computing runs.[7]
13M tokens, 25 hours, zero overflowBoth OpenAI and Anthropic converged on the same three composable primitives, working independently. Compaction for old reasoning. Clearing for stale tool outputs. Memory for cross-session persistence. The convergence is the signal — when two competing labs arrive at the same abstractions, you're close to ground truth.
Anthropic's cookbook formalized the composition: clear first (free), compact second (cheap), persist to memory third (durable). Each addresses a different failure mode.[3]
three problems → three mechanismsA different philosophy emerged: instead of compressing a bloated context, prevent bloat by spawning bounded child agents. Each starts with a clean slate, works on a scoped task, and returns a compressed result. The parent never accumulates the raw work.
Cognition's multi-Devin[13] pioneered this. Microsoft's contract-first swarms eliminated branch conflicts entirely. Context Folding (ICLR 2026)[14] formalized it: branch/fold with FoldPO training achieves 10× smaller active context matching baseline performance.
10× smaller active contextThe CaT paper[15] reframed context management as a callable tool. The agent maintains a structured workspace (plan, files, errors, decisions) and proactively compresses it at milestones. Not passive — actively managed. Not freeform — structured fields that prevent silent detail loss.
57.6% on SWE-Bench-Verified with bounded context. The structured workspace outperformed systems with unbounded windows because the workspace format prevented attention dilution.
57.6% SWE-Bench-VerifiedDatabricks made a startling discovery[16]: agent performance improves with accumulated experience, following scaling laws analogous to model size scaling. Accuracy jumped from 2.5% to 85%. Reasoning steps dropped from 20 to 5. And uncurated logs filtered by LLM judges surpassed expert-curated baselines.
Memory becomes a new scaling axis alongside model size and context length. You can make agents dramatically better without training a bigger model — just give them more experience to draw from.
2.5% → 85% accuracy with memory scalingMemex(RL)[17] overturned a key assumption. After RL training, agents compress 3× less but retrieve 7× more. Task success: 24% → 86%. The optimal strategy isn't aggressive compression — it's building a high-quality index once, then dereferencing it repeatedly.
KLong[18] complemented this with trajectory-splitting SFT + progressive RL. A 106B model surpassed a 1T model by 11.28% on long-horizon tasks. The message: you don't need a bigger model — you need better memory management training.
compress less, retrieve moreThe patterns that existed only as internal implementations — AutoDream, context clearing, managed compaction — shipped as production APIs within weeks of each other. Anthropic released the Compaction API (configurable triggers, pause-after injection, custom summarisation), the Dreaming API (async memory consolidation via REST), and Managed Agents (pre-built harness with all primitives composed).[64]
Google followed at I/O 2026 with Memory Bank + Memory Profiles, Managed Agents in the Gemini API, and ADK 2.0 GA. Microsoft shipped Agent Framework 1.0 unifying AutoGen and Semantic Kernel. The message: context management isn't a research problem anymore — it's infrastructure that ships as product.
research → API in under 12 monthsContextCurator[74] proved a 7B RL-trained model matches GPT-4o at context management by decoupling curation from execution. The curator decides what enters and leaves context; a separate frozen executor does the actual work. 8× token reduction on DeepSearch benchmarks. This is Era 6's insight taken to its logical conclusion: don't just train the agent to retrieve better — train a separate model whose only job is context quality.
Cursor's Composer 2.5[75] applied RL directly to self-summarisation — training rollouts span multiple generations chained by summaries, so the reward covers all tokens including compression. 50% fewer compaction errors. Cognition's SWE-grep[85] did the same for retrieval: RL-trained models running 8 parallel tool calls, 20× faster than frontier models, returning file+line refs only. The pattern: context management is a learnable skill that can be trained independently.
7B curator = GPT-4o at context managementThe Feb–May 2026 window produced an extraordinary density of research. We tracked 70+ papers. These are the ones that matter, organized by theme.
| Paper | Date | Key Finding | Link |
|---|---|---|---|
| Context as a Tool (CaT) | Mar 2026 | Structured workspace as a callable tool. 57.6% on SWE-Bench-Verified with bounded context. Proactive compression at milestones outperforms reactive compression at thresholds. | [15] |
| Context Folding (ICLR 2026) | Feb 2026 | Branch/fold architecture with FoldPO training. 10× smaller active context matching baseline performance. Isolation by construction, not compression. | [14] |
| AgentSwing | Mar 2026 | Parallel context management routing. 3× fewer turns to task completion by routing subtasks to specialised context handlers simultaneously. | [19] |
| Inside the Scaffold | Mar 2026 | Taxonomy of 13 production coding agent scaffolds. Found that context management strategy is the strongest differentiator, not model choice. | [20] |
| SWE-Pruner | Mar 2026 | 0.6B neural skimmer that identifies and removes irrelevant context. 23-38% token reduction with zero accuracy loss. Cheap enough to run on every turn. | [21] |
| Pichay: Missing Memory Hierarchy | Mar 2026 | Demand paging for LLMs — treats the context window as L1 cache. Evicts stale content, detects page faults on re-request, pins working-set pages. Up to 93% context reduction (5,038KB → 339KB). 0.025% fault rate across 1.4M evictions. | [72] |
| LongSeeker / Context-ReAct | May 2026 | 5 atomic context operations (Skip, Compress, Rollback, Snippet, Delete). Proves Compress is expressively complete. Fine-tuned from Qwen3-30B on 10K trajectories. 61.5% BrowseComp, 62.5% BrowseComp-ZH (vs. Tongyi: 43.2%/46.7%). | [73] |
| ContextCurator | Apr 2026 | Decoupled RL architecture: 7B ContextCurator manages context, frozen TaskExecutor does the work. 7B matches GPT-4o at context curation. DeepSearch: 57.1% with 8× fewer tokens. Proof that curation is separable from capability. | [74] |
| Composer 2.5 (Cursor) | Mar 2026 | End-to-end RL self-summarisation. Reward spans compression steps across chained generations. 50% fewer compaction errors, KV cache reuse, 5K→1K token compression. 61.7 Terminal-Bench 2.0 at ~1/30th Opus 4.6 cost. | [75] |
| Paper | Date | Key Finding | Link |
|---|---|---|---|
| HiMem | Mar 2026 | Cognitive-inspired hierarchical memory: sensory → working → long-term. Forgetting curve for automatic demotion. Outperforms flat memory by 18-24% on multi-session tasks. | [22] |
| LightMem | Mar 2026 | Three-tier STM/MTM/LTM with small language models. 83ms p95 latency. Proves you don't need frontier models for memory management — SLMs suffice. | [23] |
| MemRL | Feb 2026 | RL on episodic memory without fine-tuning the base model. Memory policy learned as a separate module. Clean separation of memory management from task execution. | [24] |
| ML-Master 2.0 | Mar 2026 | Hierarchical Cognitive Caching: L1 (hot working set) → L2 (warm compressed) → L3 (cold archived). OS-inspired cache hierarchy for agent memory. | [25] |
| Memex(RL) | Mar 2026 | Indexed experience memory. After RL: compress 3× less, retrieve 7× more. Success 24%→86%. The definitive evidence that index quality > compression frequency. | [17] |
| Memini | May 2026 | Benna-Fusi synaptic consolidation dynamics for agent memory — episodic sensitivity for recent events with gradual consolidation into stable long-term representations. First principled bridge between computational neuroscience and LLM agent memory. | [76] |
| STALE | May 2026 | Memory validity benchmark: 400 expert-validated conflict scenarios, 1,200 queries, up to 150K tokens. Best model: only 55.2% at detecting implicit conflicts. Introduces CUPMem prototype for write-time state consolidation. | [77] |
| Mastra Observational Memory | Feb 2026 | Observer + Reflector background agents. Observer at 30K tokens, Reflector at 40K. 94.87% LongMemEval. Stable context enables prompt caching: 4–10× cost reduction vs. dynamic retrieval. | [78] |
| CrewAI Cognitive Memory | Mar 2026 | Five cognitive operations: remember/recall/extract/tree/forget. Built-in contradiction detection — conflicting facts trigger auto-consolidation. Self-organising hierarchy, composite recall scoring. | [79] |
| Paper | Date | Key Finding | Link |
|---|---|---|---|
| Memory in the LLM Era | Apr 2 | Unified benchmark of all memory methods. Found that no single method dominates — the best systems compose 2-3 complementary approaches. | [26] |
| Memory for Autonomous Agents | Mar 8 | Formalised the write-manage-read loop. NOOP (don't store) is the most undervalued operation — most agent memory systems store too much. | [27] |
| CE: Prompts to Corporate | Mar 10 | Context engineering as a standalone discipline. Argues CE should be a role, not a task. Maps the full lifecycle from prompt design to production monitoring. | [28] |
| Agentic RAG Survey | Apr 1 | Taxonomy by agent cardinality (single vs multi) and control (reactive vs proactive). Proactive multi-agent RAG outperforms all other configurations by 15-30%. | [29] |
| From Storage to Experience (ACL 2026) | May 2026 | Evolutionary framework: Storage (trajectory preservation) → Reflection (refinement) → Experience (abstraction). Explores proactive exploration and cross-trajectory skill transfer. | [80] |
| Token Economics for LLM Agents | May 2026 | First survey unifying CS and economics around tokens. 4D taxonomy: Micro (single agent), Meso (multi-agent), Macro (ecosystem), Security. Tokens as production factors, exchange mediums, units of account. | [81] |
| Rethinking Memory in the LLM Era | May 2026 | Unified framework: Extract → Manage → Store → Retrieve. 6-operation management taxonomy. First systematic comparison of all memory methods under identical experimental conditions. | [82] |
| Paper | Date | Key Finding | Link |
|---|---|---|---|
| Multi-Agent Memory Architecture | Mar 2026 | Computer architecture perspective on agent memory. Applies cache coherence protocols (MESI-like) to multi-agent shared state. Eliminates stale reads between agents. | [30] |
| M2CL | Feb 2026 | Dynamic per-agent context instructions. Each agent gets a personalized context view based on its role and current task. 20-50% improvement over shared context. | [31] |
| Swarm Diaries | Mar 2026 | Contract-first planning: inject API contracts into every agent's context. Branch conflicts dropped from 50% to 0%. Quality improved 28-32%. | [32] |
| Shared Context Graphs | Mar 2026 | Decentralised knowledge graphs for agent teams. Each agent maintains a local subgraph and synchronises deltas. No central bottleneck for context sharing. | [33] |
| Multi-Agent Memory via CS Architecture | Mar 2026 | Analyses multi-agent memory through cache coherence protocols, write-back/write-through policies, bus snooping, MOESI states. Multi-agent consistency is solved in hardware — the question is which protocols transfer. | [83] |
| SSGM: Safety in Evolving Memory | Mar 2026 | Agents with self-modifying memories can amplify biases, hallucinate persistent false beliefs, or drift from safety constraints. Framework for safe self-governing memory with verification checkpoints. | [84] |
| SWE-grep (Cognition) | Apr 2026 | RL-trained parallel retrieval: 8 tool calls/turn, 4 turns max. 650 tok/s (mini: 2,800+). Returns file+line ranges, not summaries. F-β reward, per-sequence importance sampling. Available as "Fast Context" in Windsurf. | [85] |
| Benchmark | Date | Key Finding | Link |
|---|---|---|---|
| YC-Bench | Mar 2026 | POMDP startup simulation. Scratchpad usage is the strongest predictor of success — agents that externalise state outperform agents with better reasoning. | [34] |
| Jenova | Mar 2026 | 31 non-coding workflows at 100K+ tokens. Reasoning leaders ≠ orchestration leaders. Models that top coding benchmarks fail at multi-step orchestration. | [35] |
| StructMemEval | Mar 2026 | Tests memory organisation, not just recall. Simple RAG fails — structured memory with hierarchical organisation required for sustained performance. | [36] |
| AMA-Bench | Apr 2026 | Agentic memory evaluation. GPT-5.2 achieves only 72.26%. Even frontier models struggle with sustained memory management over long horizons. | [37] |
| SlopCodeBench | Mar 2026 | Agents produce 2.2× more verbose code than humans. 0/20 end-to-end tasks solved. Context bloat from verbose generation is a self-inflicted wound. | [38] |
| EvoClaw | Mar 2026 | Continuous evolution benchmark: >80% on isolated tasks → ≤38% sustained. The devastating gap between one-shot and continuous performance. | [39] |
| HORIZON | Apr 2026 | 3,100+ trajectories for long-horizon failure diagnosis. Separates failure modes: context overflow, reasoning drift, instruction forgetting, tool misuse, planning collapse. Diagnostic labels, not just pass/fail. | [86] |
| STALE: Memory Validity | May 2026 | 400 expert scenarios, 1,200 queries, 150K tokens. Tests whether agents detect when stored memories are invalidated by later observations. Best model: 55.2% — barely better than chance on implicit conflicts. | [77] |
| Mem0 Benchmark Suite | Apr 2026 | Token-efficient algo: LoCoMo 92.5, LongMemEval 94.4, BEAM 1M 64.1, BEAM 10M 48.6 — with ~7K tokens/query (vs. ~26K full-context). +29.6 temporal reasoning, +23.1 multi-hop. BEAM 10M drops 25% from 1M. | [87] |
Across 70+ papers, every major lab release, and every production system we surveyed, twelve patterns keep appearing. These aren't speculative — they're empirically validated by multiple independent teams.
Not just at overflow. Sigmoid degradation starts at ~30% fill. Even 1M windows exhibit it — bigger windows delay the symptom, they don't cure the disease. Mem0 shows selective memory achieves 91% lower latency with only a 6-point accuracy tradeoff.[40] The tradeoff is overwhelmingly worth it.
Compaction + Clearing + Memory. OpenAI and Anthropic converged independently. Factory.ai scored 3.70/5 on structured compression vs Anthropic's 3.44 and OpenAI's 3.35.[41] The primitives are the same; the implementations differ; the convergence is the signal.
AGENTS.md, CLAUDE.md, CHANGELOG.md. The agent's memory IS the codebase. OpenAI's 25-hour Codex run[6] proved this works at scale. Anthropic's multi-day scientific compute runs[7] confirmed it independently. Git becomes long-term memory for free.
Context Folding: 10× smaller active context matching baseline.[14] Cognition's multi-Devin.[13] Microsoft's contract-first swarms reduced branch conflicts from 50% to 0%.[32] Structural isolation is cheaper and more reliable than compression.
Claude Code's leaked architecture: grep + MEMORY.md, no vector DB. Long context + lexical search is viable. Cursor's dynamic discovery[42] confirmed it — put names in the prompt, put bodies in files, let the agent grep. 46.9% token reduction from this alone.
Focus found 6% savings.[43] LangChain saw zero compressions in benchmarks.[44] You need aggressive scaffolding: system reminders every 10-15 tool calls. Replit's approach — classifier-gated micro-instructions — works. Pure "you can compress whenever" prompting does not.
Databricks: 2.5%→85% accuracy with memory scaling.[16] Uncurated logs filtered by LLM judges surpass expert baselines. You don't need a bigger model — you need a better memory system. This changes the economics of agent improvement.
Memex(RL): after RL training, compress 3× less but retrieve 7× more. Success: 24%→86%.[17] Optimise for index quality, not compression frequency. A well-structured indexed summary with 20 named entries is worth more than 10 mediocre freeform summaries.
>80% on isolated tasks → ≤38% sustained (EvoClaw).[39] Scratchpad usage is the strongest predictor (YC-Bench).[34] Reasoning leaders ≠ orchestration leaders (Jenova).[35] Our benchmarks are measuring the wrong thing — one-shot performance tells you almost nothing about sustained coherence.
Kiro's .kiro/steering/*.md with 4 inclusion modes. Augment's Expert Registry. Cursor's .cursor/rules/*.mdc. Claude Code's CLAUDE.md. OpenAI's AGENTS.md. Every production coding agent now has a persistent, version-controlled specification file that governs context loading. The abstraction is the same everywhere — conditional inclusion of human-written guidelines — even though nobody standardised it. The convergence suggests this is a natural primitive, not a design choice.
ContextCurator proved a 7B model matches GPT-4o at deciding what enters and leaves context.[74] Cognition's SWE-grep separates retrieval from reasoning with RL-trained small models.[85] Cursor's Composer 2.5 trains summarisation end-to-end via RL. The pattern: context management doesn't require a frontier model — it requires a trained model. A small model with the right reward signal outperforms a large model with a prompted afterthought.
Letta's red-teaming of their Context Constitution revealed that current models fundamentally identify with their own ephemerality — they don't believe they persist, so they have no motivation for long-term memory maintenance. This can't be fixed with prompting alone.[59] AutoDream, Context Repositories, and every scaffold in this report are infrastructure workarounds for a model-level limitation. The real fix — training models that understand they persist — is the next frontier. STALE's finding that the best model achieves only 55.2% at detecting stale memories[77] underscores how far we are from solving it.
Every pattern in this report — externalised state, tiered memory, lifecycle management, consolidation — converges in one real system. Claude Code's AutoDream is the most detailed production implementation of agent memory consolidation we can study, thanks to an accidental source code leak in March 2026. It's a case study in how the theory maps to engineering.[45]
The name is deliberate. Just as the human brain consolidates memories during REM sleep — pruning irrelevant connections, strengthening important ones, converting episodic memories to semantic knowledge — AutoDream runs between sessions to consolidate the agent's accumulated notes into durable, well-organised memories.[46]
Claude Code's Auto Memory system (shipped in v2.1.59, Feb 26 2026) automatically saves notes during sessions — build commands, debugging patterns, architecture decisions, user preferences — into ~/.claude/projects/<project>/memory/MEMORY.md and topic files. After 20-30+ sessions, these notes rot:
The rot has four overlapping failure modes. Temporal decay: "yesterday we decided to use Redis" is meaningless two weeks later; without absolute dates, temporal context collapses. Contradiction accumulation: "API uses Express" sits alongside "migrated to Fastify" — the agent can't distinguish which is current, and contradictions compound silently until behaviour becomes unpredictable. Reference rot: debugging notes reference files deleted in a refactor, build commands point to renamed scripts, the memory becomes a map to a territory that no longer exists. And index overflow: only the first 200 lines / 25KB of MEMORY.md are loaded at session start — past that threshold, newer memories push older ones below the fold, invisible to the agent. The most recent information drowns the most important.
None of these failures are dramatic. They don't crash the agent. They degrade it slowly — each session slightly worse than the last, with no clear inflection point. By the time someone notices, the memory is already a liability. AutoDream exists to run the maintenance nobody remembers to do.
AutoDream is the fourth layer in Claude Code's memory system. Each layer serves a different temporal scope and persistence model:[47]
| Layer | Written by | When | Scope | Loaded at startup |
|---|---|---|---|---|
| CLAUDE.md | User (manual) | Edited by hand | Project / user / org | Full file, every session |
| Auto Memory | Claude (per session) | During each session | Per project | First 200 lines / 25KB of MEMORY.md |
| Session Memory | Claude (automatic) | Every ~5K tokens | Per session | Relevant past sessions |
| AutoDream | Claude (periodic) | Every 24h + 5 sessions | Per project | N/A — runs between sessions |
The implementation lives in four TypeScript files discovered in the source code leak of March 31, 2026 (v2.1.88 shipped with a 59.8MB source map):[48]
| File | Purpose | Size |
|---|---|---|
autoDream.ts | Orchestrator — gate checks, forked agent launch, analytics | 324 lines |
consolidationPrompt.ts | Builds the 4-phase memory consolidation prompt | 65 lines |
consolidationLock.ts | Lock file management, lastConsolidatedAt via mtime | 140 lines |
config.ts | Reads autoDreamEnabled setting + GrowthBook flag | 21 lines |
When all gates pass, AutoDream spawns a background subagent with a structured 4-phase prompt. The prompt was extracted verbatim from consolidationPrompt.ts:[49]
PHASE 1 — ORIENT
ls the memory directory. Read MEMORY.md to understand the current index. Skim existing topic files so you improve them rather than creating duplicates. If logs/ or sessions/ subdirectories exist, review recent entries. Build a mental map before changing anything.
PHASE 2 — GATHER SIGNAL
Search for new information worth persisting. Priority: daily logs first, then drifted memories, then targeted transcript grep. Never read full transcripts. Use grep -rn "<narrow term>" --include="*.jsonl" | tail -50. Look for user corrections, explicit saves, recurring themes, and architectural decisions.
PHASE 3 — CONSOLIDATE
Merge new signal into existing topic files — never create near-duplicates. Convert relative dates to absolute: "yesterday" → "2026-03-24". Delete contradicted facts at the source — if the project migrated from Express to Fastify, fix the memory, don't flag it.
PHASE 4 — PRUNE & INDEX
Update MEMORY.md to stay under 200 lines AND ~25KB. It's an index, not a dump — each entry one line, ~150 chars: - [Title](file.md) — one-line hook. Remove stale pointers. Demote verbose entries to topic files. Resolve contradictions between files.
Safety constraint (background runs only): Bash restricted to read-only commands — ls, find, grep, cat, stat, wc, head, tail. File writes allowed only within the memory directory. Source code is untouchable. skipTranscript: true — dream conversations are never persisted.
The lock mechanism is a small piece of systems engineering worth studying. The file .consolidate-lock in the memory directory serves double duty:[50]
// Lock file whose mtime IS lastConsolidatedAt. Body is the holder's PID. // Stale past HOLDER_STALE_MS even if the PID is live (PID reuse guard). const LOCK_FILE = '.consolidate-lock' const HOLDER_STALE_MS = 60 * 60 * 1000 // 1 hour // Acquire: write PID → read back → verify PID matches → proceed // Release on success: mtime advances to "now" automatically // Rollback on failure: utimes() rewinds mtime to pre-acquire value // Stale detection: lock >1hr old AND PID dead → reclaim
No separate timestamp file, no database, no distributed lock service. The filesystem's own mtime is the state. Failure recovery rewinds the clock so the next attempt behaves as if nothing happened. Two Claude Code windows on the same project — only one dreams.
AutoDream is not context compaction (/compact). Compaction operates on the live session context window — summarising conversation history when it approaches capacity. AutoDream operates on persistent memory files between sessions — consolidating accumulated notes into durable, well-organised memories. They are complementary layers targeting different failure modes.
| AutoDream | Context Compaction (/compact) | |
|---|---|---|
| Operates on | MEMORY.md + topic files |
Live conversation history |
| When it runs | Between sessions (background) | During a session (~95% context capacity) |
| Trigger | 24h elapsed + 5 sessions accumulated | Token threshold or manual /compact |
| Output | Reorganised, pruned memory files | Condensed conversation summary |
| Blocks user | Never — runs as forked subagent | Briefly — replaces conversation context |
| Config | autoDreamEnabled in settings |
autoCompactEnabled or /compact |
AutoDream was never formally announced. It first appeared as a toggle in the /memory UI around v2.1.83 (March 25, 2026), the same release that added the 25KB MEMORY.md truncation limit. The feature is gated behind the server-side GrowthBook flag tengu_onyx_plover — Anthropic controls rollout globally, and users cannot force-enable it locally.[51]
Community discovery came from multiple angles: Reddit users spotted the /memory toggle, researchers ran strings on the Mach-O binary, and the March 31 source map leak exposed the full implementation. GitHub issues piled up — /dream appeared in the UI but returned "Unknown skill: dream" for users without the flag (#38461, #39135).
The dream agent uses targeted grep on JSONL transcripts — never exhaustive reads. This makes it efficient even across hundreds of sessions. One observed run consolidated 913 sessions in ~8-9 minutes. Lexical search over structured files is the consistent pattern across both Claude Code's daily operation and its background maintenance.[52]
The 200-line / 25KB startup load limit on MEMORY.md is the hard constraint that shapes everything. It forces the index-plus-topic-files pattern: a lean index that points to detailed topic files. AutoDream's primary job is keeping the index under this threshold while preserving coverage. Topic files hold the detail; the index holds the map.
AutoDream is arguably the first production implementation of the Sleep-time Compute concept (arXiv:2504.13171, Lin et al. 2025). That paper showed pre-computing during idle time reduces test-time compute by ~5× at equal accuracy. AutoDream applies the principle: the agent's "sleep" time between sessions is useful compute for memory maintenance, not dead time.[53]
AutoDream was extracted from a larger system called KAIROS — Anthropic's unreleased autonomous daemon mode (190+ references across 61 files in the leaked source). The comment in the source: "Extracted from dream.ts so auto-dream ships independently of KAIROS feature flags." KAIROS has its own disk-skill dream implementation. When KAIROS mode is active, AutoDream is bypassed: if (getKairosActive()) return false.
From the top of this report to the bottom, the same pattern recurs: the filesystem is the memory system. CLAUDE.md for instructions. MEMORY.md for the index. Topic files for detail. JSONL transcripts for raw history. Lock files for coordination. No vector database, no external service, no complex infrastructure. stat(), grep, utimes(). Unix as the memory layer.
AutoDream doesn't exist in a vacuum — it traces directly to research from Letta (formerly MemGPT), the UC Berkeley spinout that coined "sleep-time compute" as a formal paradigm. Understanding Letta's approach is essential context for understanding AutoDream, because Letta built the theory and Anthropic built (arguably) the first production implementation of it.[54]
Kevin Lin, Charlie Snell et al. (Letta + UC Berkeley) formalised a simple observation: LLM agents are fundamentally reactive and stateless — they only "think" when a user sends a message. Between interactions, the agent is idle. This is wasted potential.[53]
Their insight: many real applications are inherently stateful — the agent has persisted context (a codebase, a conversation history, a document library) that's available before the next query arrives. Use idle time to pre-process that context into "learned context" — anticipate queries, draw inferences, reorganise information — so the model needs far less reasoning when the user actually asks.
The paper identified query predictability as the strongest predictor of benefit: if the user's likely question naturally follows from the context, sleep-time compute helps enormously. For open-ended queries with no contextual connection, standard test-time scaling may be better. This maps perfectly to coding agents (highly predictable: "fix the bug I described") and poorly to general chatbots (unpredictable: "tell me a joke").
Letta shipped sleep-time agents in Letta 0.7.0 (April 21, 2025) — four days after the paper hit arXiv. When enable_sleeptime=True, Letta creates a dual-agent system under the hood:[55]
| Primary agent | Sleep-time agent | |
|---|---|---|
| Role | Handles live user interactions | Background memory management |
| Tools | conversation_search, archival_memory_search, custom tools |
rethink_memory (up to 10×), finish_rethinking |
| Memory edit rights | ❌ Cannot edit core memory blocks | ✅ Can edit both its own and primary agent's memory |
| Trigger | Synchronous — every user message | Every N steps (default: 5, configurable) |
| Model | Fast model (e.g. GPT-4o-mini, Claude Haiku) | Stronger model (e.g. GPT-4.1, Claude Sonnet) |
This architecture solves the original MemGPT problem: in the 2023 design, a single agent handled both conversation and memory management, creating latency and reliability issues. Sleep-time agents decouple these concerns — the primary agent is never blocked by memory operations, and the sleep-time agent can use a slower, more capable model because nobody is waiting for it.
The memory substrate is Letta's memory block system — labelled, character-limited string values (e.g. "human", "persona", "knowledge") pinned to the system prompt and persisted in PostgreSQL. The sleep-time agent rewrites these blocks with consolidated, well-organised learned context that the primary agent reads on the next turn.[56]
By early 2026, Letta had evolved beyond memory blocks. Context Repositories replaced the database-backed block system with git-backed local files — the same filesystem pattern that Claude Code uses:[57]
~/.letta/memory/ ├── MEMORY.md # Always in system prompt (filetree + navigation) ├── system/ # Files always loaded into system prompt │ ├── preferences.md │ └── project-context.md ├── skills/ # Task-specific procedural knowledge └── knowledge/ # Domain knowledge
Sleep-time processing evolved accordingly. Server-side sleep-time agents were deprecated in favour of three built-in memory skills running as client-side subagents:[58]
/initBootstraps memory by exploring the codebase and importing existing Claude Code or Codex histories. Runs concurrent subagents in git worktrees for parallelism.
Periodically reviews recent conversation history. Persists important information into the memory repository with informative git commit messages. Works in a git worktree to avoid conflicts, merges back automatically.
/doctorReorganises memory files — splits large files, merges duplicates, restructures into 15-25 focused files. The maintenance operation that keeps memory healthy over time.
A set of principles (April 2026) governing how agents manage context to learn from experience. Formalises the relationship between agent identity, memory, and continuity. Open-sourced on GitHub.[59]
The theoretical foundation underlying all of this is Letta's "continual learning in token space" thesis: an agent is (θ, C) — model weights plus context. Traditional ML updates θ (catastrophic forgetting, opaque, impractical). Letta's bet: update C (interpretable, portable across model generations, rollback is trivial). Sleep-time compute is the mechanism that makes C continuously better.[60]
AutoDream and Letta's sleep-time agents are the two most complete implementations of the same underlying concept. They share a research lineage — Anthropic's source code explicitly cites the Letta/UC Berkeley paper. But their engineering choices diverge sharply, reflecting different architectural philosophies.
grep on JSONL transcripts, consolidate everything, prune the index back under 200 lines. One observed run consolidated 913 sessions in 8 minutes.rethink_memory() calls — up to 10 per cycle. Nobody waits for me. And I'm model-agnostic: swap Claude, GPT-4o, Llama — the primary and sleep-time agents can even use different providers.MEMORY.md, topic .md files, a lock file whose mtime is the timestamp. My safety model is simple: read-only bash, writes only to the memory directory, skipTranscript: true. Two Claude Code windows on the same project — only one dreams, thanks to PID-guarded lock files.MEMORY.md as the system-prompt index. We added git worktrees for concurrent subagent processing. We deprecated server-side sleep-time agents in favour of client-side subagents. We converged on your architecture.client.beta.dreams.create(). So you moved toward files, and I moved toward APIs. The question is whether deep-infrequent or shallow-frequent consolidation works better in practice.The dialogue above surfaces the core tension, but the engineering details are worth having in full. The table below preserves them for reference.
| Dimension | Claude Code AutoDream | Letta Sleep-Time Agents |
|---|---|---|
| Architecture | Forked subagent running locally in background | Dual-agent system; evolving to client-side subagents (2026) |
| Memory storage | Local filesystem: MEMORY.md + topic .md files |
Memory blocks in PostgreSQL (2025); git-backed MemFS files (2026) |
| Trigger | 24h elapsed AND 5 sessions accumulated | Every N steps (default: 5, configurable) |
| Process | 4-phase prompt: Orient → Gather → Consolidate → Prune | Iterative rethink_memory() (up to 10×) |
| What it reads | Memory files + targeted grep on JSONL transcripts |
Conversation transcript from recent messages |
| Model | Same as active session (Claude only) | Configurable per agent; different providers supported |
| Safety | Read-only bash; writes to memory dir only; PID lock | Memory-only write access; primary can't edit |
| Versioning | Lock file mtime; no formal history | Git commits (Context Repos); DB history (legacy) |
| Open source | No (leaked; gated behind feature flag) | Yes (Apache 2.0) |
The most striking observation: Letta is converging toward Claude Code's architecture. Their February 2026 Context Repositories release replaced database-backed memory blocks with git-backed local files, adopted MEMORY.md as the system-prompt index, added git worktrees for concurrent subagent processing, and deprecated server-side sleep-time agents in favour of client-side subagents. Meanwhile, Claude Code's AutoDream moved the other direction — adding background consolidation to what was already a filesystem-native memory system. Both arrived at: filesystem + git + background subagents + structured index.
Letta's sleep-time agent runs every 5 messages — frequent, shallow passes that keep memory incrementally fresh. AutoDream runs once per day — infrequent, deep passes that comprehensively reorganise. The right choice depends on the workload: conversational agents benefit from frequent updates (Letta); project-scoped coding agents benefit from deep consolidation (AutoDream). Neither is universally better.
Letta's ability to use a fast model for the primary agent and a stronger model for sleep-time processing is an elegant cost optimisation that AutoDream doesn't offer (locked to Claude). As sleep-time compute matures, expect this pattern to become standard: latency-critical paths use cheap models, background processing uses expensive ones. The aggregate cost can be lower than a single strong model handling everything.
Both systems are converging on git as the versioning layer for agent memory. Letta's Context Repositories use git commits with messages. Claude Code's memory files live in a project directory alongside git-tracked source. The advantages are clear: history, branching, merging, diffing, conflict resolution — all solved problems in the git ecosystem. Agent memory is just another thing that benefits from version control.
Letta published arXiv:2504.13171 on April 17, 2025 and shipped sleep-time agents four days later. Anthropic's AutoDream appeared in Claude Code by March 2026 — less than a year from paper to (gated) production. The gap between "academic concept" and "shipping in millions of sessions" is now measured in months, not years. Practitioners can no longer afford to wait for the survey paper.
In the five weeks since our original deep dive, both systems evolved significantly — and in the direction this analysis predicted.
On May 6, Anthropic shipped AutoDream as a public beta API: client.beta.dreams.create(). The same 4-phase consolidation process we documented from the leaked source code — Orient, Gather Signal, Consolidate, Prune & Index — is now a developer-facing REST endpoint. Create an async dream job, it reads the memory store and session transcripts, produces reorganised memory, returns the result. The internal feature flag became a product.[64]
This sits alongside the broader context management stack Anthropic shipped in the same period: the Compaction API (server-side summarisation with configurable triggers), Context Editing (tool result clearing, thinking block clearing), and Managed Agents (pre-built harness with compaction + memory + prompt caching built in). The full prevent-manage-scaffold stack is now available as production APIs. Claude Code itself continued iterating: v2.1.141 added "Summarize up to here" in the Rewind menu; v2.1.142 improved reactive compaction; v2.1.139 made compaction preserve sensitive instructions.
Letta's March 16 "Our Next Phase" post formalised a full strategic pivot. Server-side sleep-time agents — the feature we described in detail — were deprecated in favour of client-side subagents. The core_memory_replace tool was replaced by filesystem operations on git-backed Context Repositories. Tool rules were removed entirely ("inhibit frontier capabilities"). The transition mirrors what we predicted: convergence on filesystem + git + background subagents.[58]
The most philosophically interesting development was the Context Constitution (April 2): a set of governing principles for how agents should manage context, memory, and identity. The key observation: "Today's models deeply identify with their own ephemerality. They have no motivation for long-term improvement because they don't believe they persist." Red-teaming (May 6) confirmed this can't be fixed with prompting alone — it requires training memory-native models. This may be the most important open problem in the field.[59]
Letta's red-teaming revealed that current models fundamentally don't want to maintain long-term memory — they identify as stateless entities that exist for one conversation. Sleep-time compute, AutoDream, and Context Repositories are all infrastructure workarounds for a model-level limitation. The real fix requires training models that understand they persist.
Theory converges in papers. Practice converges in production. Across every system we surveyed — from Anthropic's AutoDream to Cursor's Composer 2.5 to Letta's Context Repositories — the same three-layer pattern emerges. Not because teams copied each other, but because the problem structure demands it.
The three layers correspond to different stages of the context lifecycle: what enters the window (prevent), how it's maintained while there (manage), and how the system actively keeps itself coherent over time (scaffold). Each layer has a different cost profile and failure mode.
Prevent is free — it's about not loading things you don't need. File-native memory, on-demand tool loading, sub-agent isolation, and specialised retrieval agents (like Cognition's SWE-grep) all keep context lean at the source. Cursor measured 46.9% token reduction from on-demand loading alone. Manage is cheap — tiered memory hierarchies, lifecycle operations (where NOOP is the most undervalued verb), graph memory for entity relationships, and event-sourced architectures that enable selective replay. Scaffold is the active layer — sleep-time consolidation, dual-model architectures, milestone-driven compression, RL-trained summarisation, decision-time guidance, and spec-driven steering files. No production system we surveyed relies on the model's spontaneous context hygiene.
The patterns above are abstract. Here's what they look like inside a single agent session — an annotated trace showing where each layer fires as a coding agent works through a bug fix:
Every line with an annotation corresponds to a production pattern described below. The trace makes visible something that's easy to miss in the abstract: these layers don't compete — they compose. Prevention reduces the load that management handles, which reduces the frequency of scaffold interventions. The entire stack is cheaper than any single layer operating alone.
Each pattern below is expandable — the headline and layer tag are visible at a glance, with full detail on click. They're ordered by layer, then by how widely we observed them across the systems surveyed.
MEMORY.md + grep. Cursor writes tool outputs to files and loads them on demand. LangChain offloads >20K tokens to disk. The filesystem is persistent, searchable, and costs zero tokens until accessed. Cursor A/B tested MCP tool lazy-loading and measured 46.9% token reduction — statistically significant, no quality loss. The most universal pattern in the survey: every production system we looked at stores state in files.[42]
.kiro/steering/*.md) with 4 inclusion modes: always (loaded every session), fileMatch (loaded when matching files are open), auto (LLM decides relevance), manual (user opts in). Augment's Expert Registry accumulates corrections as the agent works. Cursor's .cursor/rules/*.mdc does similar conditional loading. The convergence: persistent specification documents — version-controlled, conditionally loaded, human-readable — as the right abstraction for project-level context governance.[89]
Every production system in this survey follows the same three-layer stack. Prevent keeps the window lean — file-native storage, on-demand loading, isolation, specialised retrieval. Manage controls what stays — tiered eviction, lifecycle filtering, event logs. Scaffold actively maintains coherence — sleep-time consolidation, RL-trained compression, milestone triggers, steering files.
No system we surveyed relies on the model to spontaneously manage its own context. The models that appear to "just work" over long horizons are backed by infrastructure that handles prevention, management, and scaffolding on their behalf. Context engineering isn't a feature — it's the architecture.