2026-05-21 · Blas Labs · research

Managing Context in
Long-Horizon Agents

A comprehensive survey of the Feb–May 2026 landscape: what every major lab, framework, and paper says about keeping agents coherent over hundreds of steps.

research context engineering agent systems memory may 2026

problem primitives lab reports evolution research patterns deep dive playbook references

1 the problem

"Context rot is the #1 failure mode"

The conventional wisdom held that context management was about overflow — compress when you hit the wall. That framing is dangerously wrong. Context rot degrades quality continuously, not catastrophically. Even with 1M token windows, agents lose coherence, re-read files they already analyzed, and drift from their original objectives. Attention is a finite budget, and every stale tool result competes with your current task for that budget.[1]

The quality degradation follows a sigmoid: flat until ~30% fill, then accelerating decline. At 60% fill, quality is 70-85% of baseline. At 80%+, near failure. This isn't speculative — Chroma measured it, Factory.ai confirmed it, and every production team we surveyed reported the same shape.[2]

Three compounding mechanisms drive this rot:

Latency

Prefill scales linearly

Every additional token increases time-to-first-token. A 100K context is 4× slower to prefill than 25K. In agentic loops that run hundreds of turns, this compounds into hours of wasted compute.

Attention dilution

Signal drowns in noise

Attention is a finite budget. Stale tool results from turn 3 compete with your current task at turn 47 for the same capacity. The "Lost in the Middle" effect is real and gets worse with scale.

Re-reading loops

Summaries lose details → re-fetch

Summary drops a file path → agent re-fetches → context grows → needs another summary. Factory.ai measured 10-20× token multiplication from this cycle alone. The dreaded "doom loop."

Context rot

Quality degrades sigmoidally

Flat until ~30% fill, then accelerating decline. At 60% fill, quality is 70-85% of baseline. At 80%+, near failure. The curve is the same across every model and every system.

Quality retention vs. context fill %

quality degradation follows a sigmoid — flat until 30%, then accelerating. compact early, not late.

Industry consensus shift

The industry consensus has shifted from "how do we fit more in?" to "how do we keep what's there high-signal?" Bigger windows don't solve the problem — they delay the symptom while making the underlying rot harder to detect. The 1M token window is a trap if you fill it.

2 the primitives

"Three mechanisms are crystallising"

Something remarkable happened in February-March 2026: both OpenAI and Anthropic, working independently, converged on the same three composable primitives for context management. Not the same implementations — the same abstractions. This convergence is the strongest signal we have that the field is finding ground truth.

① COMPACTION

Server-side summarisation

Condense old reasoning while preserving key decisions. OpenAI's server-side compaction triggers at a token threshold. Anthropic's compaction beta does the same. The model's chain of thought is compressed into structured summaries — not discarded, not truncated, but distilled.

openai.compaction · anthropic.compaction_beta

② CLEARING

Drop re-fetchable outputs

Drop old tool outputs while keeping the call record. Anthropic's tool-result clearing is the cleanest implementation. The model's reasoning already captured the finding; the raw 3000-line file dump can go. Keep the what was done, discard the what it contained.

anthropic.tool_result_clearing

③ MEMORY

Structured external storage

Note-taking to external storage with cross-session persistence. Both OpenAI (RunContextWrapper state) and Anthropic (memory tool) ship this. The agent writes to and reads from a persistent store that survives context resets and session boundaries.

openai.context_wrapper · anthropic.memory_tool

The key insight

The agent's own reasoning IS the summary. When it reads foo.py and concludes "the bug is on line 247," that conclusion lives in its assistant message. The 2847-line file can be evicted. This is why clearing works — you're not losing information, you're removing redundancy.

The power is in composition. Compaction alone is a blunt instrument. Clearing alone loses too aggressively. Memory alone doesn't solve the within-session problem. But layer all three — clear old tool results, compact the reasoning chain at milestones, persist key decisions to memory — and you get a system that stays coherent over hundreds of steps.[3]

3 the lab reports

What the big labs are shipping

The Feb–April 2026 period saw an unprecedented density of production releases and research reports from every major lab. Not academic speculation — shipped systems, real deployments, concrete numbers.

OpenAI — Skills, Shell, Memory Sources

Shell + Skills + Compaction (Feb 11): Three agentic primitives — Skills (versioned instruction bundles), Shell (containerised exec), server-side compaction (auto-summarise past a token threshold). A 25-hour, ~13M-token Codex run built a design tool by externalising context into repo files (AGENTS.md, Prompt.md, Plan.md) so the agent re-grounds from durable artefacts rather than raw conversation history.[4]

Memory Sources (May 5): GPT-5.5 now pulls context from past chats, saved memories, uploaded files, and Gmail. Cross-conversation memory means the model carries forward knowledge without manual re-prompting. GPT-5.5 Instant is specifically better at cross-chat contextual recall.[61]

Codex CLI 0.132.0 (May 2026): Versioned memory summaries — auto-rebuilt when format becomes stale. Fair skill description trimming within context budget: skills are trimmed proportionally rather than dropping entire skills. The /goal workflow for long-horizon Codex tasks ships structured context management with explicit plan files.[5]

Anthropic — The Full Stack Ships

March–May 2026 was Anthropic's most concentrated infrastructure push. They shipped every layer of context management as a production API.

Server-side Compaction API (beta): Configurable automatic summarisation when input tokens exceed a trigger threshold (min 50K, default 150K). Returns a compaction block containing the summary; subsequent requests automatically drop messages before it. pause_after_compaction: true lets developers inject context before resuming. Custom summarisation instructions completely replace the default prompt. 84% token reduction in 100-turn web search evaluation; 29% performance improvement.[62]

Context Editing (expanded): Tool result clearing (clear_tool_uses_20250919) — clears oldest tool call/result pairs, keeping N most recent. Thinking block clearing (clear_thinking_20251015) — manages extended thinking blocks with model-specific defaults. Token counting with context preview for precise budget management.[63]

Dreaming API (public beta, May 6): The productisation of AutoDream. client.beta.dreams.create() launches an async dream job that reads memory store + session transcripts and produces reorganised, consolidated memory. This is the same 4-phase process we detail in §7, now available as a REST API.[64]

Managed Agents (public beta, Apr 1): Pre-built agent harness with built-in compaction, memory tool, and prompt caching. "Outcomes" feature — agent iterates until a goal is met. Multi-agent orchestration built in.[65]

1M context GA (Mar 13): Opus 4.6 and Sonnet 4.6 at standard pricing. Combined with the April 23 postmortem where three context bugs (effort level regression, thinking block clearing, verbosity) were shipped and rapidly fixed — showing how aggressively they're iterating on this surface.[66]

Google — Memory Bank, ADK 2.0, I/O 2026

Google Cloud Next (Apr 22–24): The headline launch was Memory Bank + Memory Profiles — agents dynamically generate and curate long-term memories from conversations, recall high-accuracy details with low latency. Production deployments at Gurunavi (proactively presenting options based on past actions) and Payhawk (auto-submitting expenses based on remembered habits, 50%+ time reduction). Custom Session IDs map directly to internal database/CRM records. Multi-day autonomous agent workflows now supported.[67]

Google I/O 2026 (May 19): Gemini 3.5 Flash (agentic model with 1M context), Managed Agents in the Gemini API with persistent isolated environments, Antigravity 2.0 (CLI + desktop app), ADK 2.0 GA with graph-based agent networks. 6+ trillion tokens processed monthly through Gemini models via ADK.[68]

ADK architecture post (May 12): Three shifts — durable memory schemas, event-driven dormancy gates, multi-agent delegation. Key quote: "The fix isn't a bigger context window." ReasoningBank (ICLR 2026) distils from successes AND failures: +8.3% on WebArena, +4.6% SWE-Bench. MaTTS: memory-aware test-time scaling.[69]

Meta — Muse Spark, HyperAgents, Llama 4

HyperAgents (Mar 24): Self-improving agent system where agents invent their own memory structures by generation 3. No human-designed memory architecture — the system discovers what to store and how to organise it through evolutionary self-improvement. Published as a research paper with striking implications for the "should we design memory or let agents learn it?" debate.[70]

Muse Spark (Apr 8): Natively multimodal agent from Meta Superintelligence Labs. Contemplating Mode uses multi-agent parallelism — the agent spawns multiple reasoning threads and synthesises. Thought Compression is RL-trained compression of internal reasoning, reducing token overhead while preserving quality. Not open-weight (a departure for Meta), limited to private preview.[71]

Llama 4 Scout: 10M token context window via iRoPE architecture — the largest production context window to date. Llama 4 Maverick: MoE architecture at standard context lengths. Both open-weight, continuing Meta's approach of pushing the frontier on raw context capacity while leaving memory management to the ecosystem.

4 the evolution

How context management evolved in 120 days

Six eras, each building on the last. Diagrams on the left, story on the right.

externalised state · filesystem as memory

Era 1 · Feb 2026 — externalise state

Push context into the repo

The first breakthrough wasn't algorithmic — it was architectural. Push state into the filesystem. AGENTS.md, CLAUDE.md, CHANGELOG.md, Plan.md. The agent's memory IS the codebase. Git commits become the long-term memory. Markdown files become the working memory.[6]

OpenAI's Codex proved it: 13M tokens across 25 hours, zero overflow, zero coherence loss. Anthropic's long-running Claude confirmed it independently with multi-day scientific computing runs.[7]

13M tokens, 25 hours, zero overflow

OpenAI Codex · Anthropic Claude Code

three primitives · composable layers

Era 2 · Feb-Mar 2026 — three-primitive toolkit

Compaction + Clearing + Memory

Both OpenAI and Anthropic converged on the same three composable primitives, working independently. Compaction for old reasoning. Clearing for stale tool outputs. Memory for cross-session persistence. The convergence is the signal — when two competing labs arrive at the same abstractions, you're close to ground truth.

Anthropic's cookbook formalized the composition: clear first (free), compact second (cheap), persist to memory third (durable). Each addresses a different failure mode.[3]

three problems → three mechanisms

OpenAI · Anthropic

sub-agent isolation · bounded contexts

Era 3 · Mar 2026 — sub-agent isolation

Don't compress — don't accumulate

A different philosophy emerged: instead of compressing a bloated context, prevent bloat by spawning bounded child agents. Each starts with a clean slate, works on a scoped task, and returns a compressed result. The parent never accumulates the raw work.

Cognition's multi-Devin[13] pioneered this. Microsoft's contract-first swarms eliminated branch conflicts entirely. Context Folding (ICLR 2026)[14] formalized it: branch/fold with FoldPO training achieves 10× smaller active context matching baseline performance.

10× smaller active context

Cognition · Cursor · Microsoft

CaT · structured workspace tool

Era 4 · Mar 2026 — context as a tool

The agent manages its own workspace

The CaT paper[15] reframed context management as a callable tool. The agent maintains a structured workspace (plan, files, errors, decisions) and proactively compresses it at milestones. Not passive — actively managed. Not freeform — structured fields that prevent silent detail loss.

57.6% on SWE-Bench-Verified with bounded context. The structured workspace outperformed systems with unbounded windows because the workspace format prevented attention dilution.

57.6% SWE-Bench-Verified

ACL ARR 2026

memory scaling · new scaling axis

Era 5 · Apr 2026 — memory scaling

Memory as a new scaling axis

Databricks made a startling discovery[16]: agent performance improves with accumulated experience, following scaling laws analogous to model size scaling. Accuracy jumped from 2.5% to 85%. Reasoning steps dropped from 20 to 5. And uncurated logs filtered by LLM judges surpassed expert-curated baselines.

Memory becomes a new scaling axis alongside model size and context length. You can make agents dramatically better without training a bigger model — just give them more experience to draw from.

2.5% → 85% accuracy with memory scaling

Databricks · Mem0

memex(rl) + klong · rl-optimised memory

Era 6 · 2026 — rl-trained memory

Compress selectively, retrieve frequently

Memex(RL)[17] overturned a key assumption. After RL training, agents compress 3× less but retrieve 7× more. Task success: 24% → 86%. The optimal strategy isn't aggressive compression — it's building a high-quality index once, then dereferencing it repeatedly.

KLong[18] complemented this with trajectory-splitting SFT + progressive RL. A 106B model surpassed a 1T model by 11.28% on long-horizon tasks. The message: you don't need a bigger model — you need better memory management training.

compress less, retrieve more

Memex(RL) · KLong

api productisation · infrastructure becomes product

Era 7 · Apr-May 2026 — api productisation

Internal infrastructure becomes developer APIs

The patterns that existed only as internal implementations — AutoDream, context clearing, managed compaction — shipped as production APIs within weeks of each other. Anthropic released the Compaction API (configurable triggers, pause-after injection, custom summarisation), the Dreaming API (async memory consolidation via REST), and Managed Agents (pre-built harness with all primitives composed).[64]

Google followed at I/O 2026 with Memory Bank + Memory Profiles, Managed Agents in the Gemini API, and ADK 2.0 GA. Microsoft shipped Agent Framework 1.0 unifying AutoGen and Semantic Kernel. The message: context management isn't a research problem anymore — it's infrastructure that ships as product.

research → API in under 12 months

Anthropic APIs · Google I/O 2026 · Microsoft MAF 1.0

learned curation · rl trains the curator, not just the agent

Era 8 · Mar-May 2026 — learned curation

RL trains the curator, not just the agent

ContextCurator[74] proved a 7B RL-trained model matches GPT-4o at context management by decoupling curation from execution. The curator decides what enters and leaves context; a separate frozen executor does the actual work. 8× token reduction on DeepSearch benchmarks. This is Era 6's insight taken to its logical conclusion: don't just train the agent to retrieve better — train a separate model whose only job is context quality.

Cursor's Composer 2.5[75] applied RL directly to self-summarisation — training rollouts span multiple generations chained by summaries, so the reward covers all tokens including compression. 50% fewer compaction errors. Cognition's SWE-grep[85] did the same for retrieval: RL-trained models running 8 parallel tool calls, 20× faster than frontier models, returning file+line refs only. The pattern: context management is a learnable skill that can be trained independently.

7B curator = GPT-4o at context management

ContextCurator · Cursor Composer 2.5 · Cognition SWE-grep

5 the research frontier

Key papers, organized by theme

The Feb–May 2026 window produced an extraordinary density of research. We tracked 70+ papers. These are the ones that matter, organized by theme.

Paper	Date	Key Finding	Link
Context as a Tool (CaT)	Mar 2026	Structured workspace as a callable tool. 57.6% on SWE-Bench-Verified with bounded context. Proactive compression at milestones outperforms reactive compression at thresholds.	[15]
Context Folding (ICLR 2026)	Feb 2026	Branch/fold architecture with FoldPO training. 10× smaller active context matching baseline performance. Isolation by construction, not compression.	[14]
AgentSwing	Mar 2026	Parallel context management routing. 3× fewer turns to task completion by routing subtasks to specialised context handlers simultaneously.	[19]
Inside the Scaffold	Mar 2026	Taxonomy of 13 production coding agent scaffolds. Found that context management strategy is the strongest differentiator, not model choice.	[20]
SWE-Pruner	Mar 2026	0.6B neural skimmer that identifies and removes irrelevant context. 23-38% token reduction with zero accuracy loss. Cheap enough to run on every turn.	[21]
Pichay: Missing Memory Hierarchy	Mar 2026	Demand paging for LLMs — treats the context window as L1 cache. Evicts stale content, detects page faults on re-request, pins working-set pages. Up to 93% context reduction (5,038KB → 339KB). 0.025% fault rate across 1.4M evictions.	[72]
LongSeeker / Context-ReAct	May 2026	5 atomic context operations (Skip, Compress, Rollback, Snippet, Delete). Proves Compress is expressively complete. Fine-tuned from Qwen3-30B on 10K trajectories. 61.5% BrowseComp, 62.5% BrowseComp-ZH (vs. Tongyi: 43.2%/46.7%).	[73]
ContextCurator	Apr 2026	Decoupled RL architecture: 7B ContextCurator manages context, frozen TaskExecutor does the work. 7B matches GPT-4o at context curation. DeepSearch: 57.1% with 8× fewer tokens. Proof that curation is separable from capability.	[74]
Composer 2.5 (Cursor)	Mar 2026	End-to-end RL self-summarisation. Reward spans compression steps across chained generations. 50% fewer compaction errors, KV cache reuse, 5K→1K token compression. 61.7 Terminal-Bench 2.0 at ~1/30th Opus 4.6 cost.	[75]

Paper	Date	Key Finding	Link
HiMem	Mar 2026	Cognitive-inspired hierarchical memory: sensory → working → long-term. Forgetting curve for automatic demotion. Outperforms flat memory by 18-24% on multi-session tasks.	[22]
LightMem	Mar 2026	Three-tier STM/MTM/LTM with small language models. 83ms p95 latency. Proves you don't need frontier models for memory management — SLMs suffice.	[23]
MemRL	Feb 2026	RL on episodic memory without fine-tuning the base model. Memory policy learned as a separate module. Clean separation of memory management from task execution.	[24]
ML-Master 2.0	Mar 2026	Hierarchical Cognitive Caching: L1 (hot working set) → L2 (warm compressed) → L3 (cold archived). OS-inspired cache hierarchy for agent memory.	[25]
Memex(RL)	Mar 2026	Indexed experience memory. After RL: compress 3× less, retrieve 7× more. Success 24%→86%. The definitive evidence that index quality > compression frequency.	[17]
Memini	May 2026	Benna-Fusi synaptic consolidation dynamics for agent memory — episodic sensitivity for recent events with gradual consolidation into stable long-term representations. First principled bridge between computational neuroscience and LLM agent memory.	[76]
STALE	May 2026	Memory validity benchmark: 400 expert-validated conflict scenarios, 1,200 queries, up to 150K tokens. Best model: only 55.2% at detecting implicit conflicts. Introduces CUPMem prototype for write-time state consolidation.	[77]
Mastra Observational Memory	Feb 2026	Observer + Reflector background agents. Observer at 30K tokens, Reflector at 40K. 94.87% LongMemEval. Stable context enables prompt caching: 4–10× cost reduction vs. dynamic retrieval.	[78]
CrewAI Cognitive Memory	Mar 2026	Five cognitive operations: remember/recall/extract/tree/forget. Built-in contradiction detection — conflicting facts trigger auto-consolidation. Self-organising hierarchy, composite recall scoring.	[79]

Paper	Date	Key Finding	Link
Memory in the LLM Era	Apr 2	Unified benchmark of all memory methods. Found that no single method dominates — the best systems compose 2-3 complementary approaches.	[26]
Memory for Autonomous Agents	Mar 8	Formalised the write-manage-read loop. NOOP (don't store) is the most undervalued operation — most agent memory systems store too much.	[27]
CE: Prompts to Corporate	Mar 10	Context engineering as a standalone discipline. Argues CE should be a role, not a task. Maps the full lifecycle from prompt design to production monitoring.	[28]
Agentic RAG Survey	Apr 1	Taxonomy by agent cardinality (single vs multi) and control (reactive vs proactive). Proactive multi-agent RAG outperforms all other configurations by 15-30%.	[29]
From Storage to Experience (ACL 2026)	May 2026	Evolutionary framework: Storage (trajectory preservation) → Reflection (refinement) → Experience (abstraction). Explores proactive exploration and cross-trajectory skill transfer.	[80]
Token Economics for LLM Agents	May 2026	First survey unifying CS and economics around tokens. 4D taxonomy: Micro (single agent), Meso (multi-agent), Macro (ecosystem), Security. Tokens as production factors, exchange mediums, units of account.	[81]
Rethinking Memory in the LLM Era	May 2026	Unified framework: Extract → Manage → Store → Retrieve. 6-operation management taxonomy. First systematic comparison of all memory methods under identical experimental conditions.	[82]

Paper	Date	Key Finding	Link
Multi-Agent Memory Architecture	Mar 2026	Computer architecture perspective on agent memory. Applies cache coherence protocols (MESI-like) to multi-agent shared state. Eliminates stale reads between agents.	[30]
M2CL	Feb 2026	Dynamic per-agent context instructions. Each agent gets a personalized context view based on its role and current task. 20-50% improvement over shared context.	[31]
Swarm Diaries	Mar 2026	Contract-first planning: inject API contracts into every agent's context. Branch conflicts dropped from 50% to 0%. Quality improved 28-32%.	[32]
Shared Context Graphs	Mar 2026	Decentralised knowledge graphs for agent teams. Each agent maintains a local subgraph and synchronises deltas. No central bottleneck for context sharing.	[33]
Multi-Agent Memory via CS Architecture	Mar 2026	Analyses multi-agent memory through cache coherence protocols, write-back/write-through policies, bus snooping, MOESI states. Multi-agent consistency is solved in hardware — the question is which protocols transfer.	[83]
SSGM: Safety in Evolving Memory	Mar 2026	Agents with self-modifying memories can amplify biases, hallucinate persistent false beliefs, or drift from safety constraints. Framework for safe self-governing memory with verification checkpoints.	[84]
SWE-grep (Cognition)	Apr 2026	RL-trained parallel retrieval: 8 tool calls/turn, 4 turns max. 650 tok/s (mini: 2,800+). Returns file+line ranges, not summaries. F-β reward, per-sequence importance sampling. Available as "Fast Context" in Windsurf.	[85]

Benchmark	Date	Key Finding	Link
YC-Bench	Mar 2026	POMDP startup simulation. Scratchpad usage is the strongest predictor of success — agents that externalise state outperform agents with better reasoning.	[34]
Jenova	Mar 2026	31 non-coding workflows at 100K+ tokens. Reasoning leaders ≠ orchestration leaders. Models that top coding benchmarks fail at multi-step orchestration.	[35]
StructMemEval	Mar 2026	Tests memory organisation, not just recall. Simple RAG fails — structured memory with hierarchical organisation required for sustained performance.	[36]
AMA-Bench	Apr 2026	Agentic memory evaluation. GPT-5.2 achieves only 72.26%. Even frontier models struggle with sustained memory management over long horizons.	[37]
SlopCodeBench	Mar 2026	Agents produce 2.2× more verbose code than humans. 0/20 end-to-end tasks solved. Context bloat from verbose generation is a self-inflicted wound.	[38]
EvoClaw	Mar 2026	Continuous evolution benchmark: >80% on isolated tasks → ≤38% sustained. The devastating gap between one-shot and continuous performance.	[39]
HORIZON	Apr 2026	3,100+ trajectories for long-horizon failure diagnosis. Separates failure modes: context overflow, reasoning drift, instruction forgetting, tool misuse, planning collapse. Diagnostic labels, not just pass/fail.	[86]
STALE: Memory Validity	May 2026	400 expert scenarios, 1,200 queries, 150K tokens. Tests whether agents detect when stored memories are invalidated by later observations. Best model: 55.2% — barely better than chance on implicit conflicts.	[77]
Mem0 Benchmark Suite	Apr 2026	Token-efficient algo: LoCoMo 92.5, LongMemEval 94.4, BEAM 1M 64.1, BEAM 10M 48.6 — with ~7K tokens/query (vs. ~26K full-context). +29.6 temporal reasoning, +23.1 multi-hop. BEAM 10M drops 25% from 1M.	[87]

6 converging patterns

"What the best systems have in common"

Across 70+ papers, every major lab release, and every production system we surveyed, twelve patterns keep appearing. These aren't speculative — they're empirically validated by multiple independent teams.

Context rot is real and continuous

Not just at overflow. Sigmoid degradation starts at ~30% fill. Even 1M windows exhibit it — bigger windows delay the symptom, they don't cure the disease. Mem0 shows selective memory achieves 91% lower latency with only a 6-point accuracy tradeoff.[40] The tradeoff is overwhelmingly worth it.

The three-primitive toolkit is crystallising

Compaction + Clearing + Memory. OpenAI and Anthropic converged independently. Factory.ai scored 3.70/5 on structured compression vs Anthropic's 3.44 and OpenAI's 3.35.[41] The primitives are the same; the implementations differ; the convergence is the signal.

Externalise state into the repo

AGENTS.md, CLAUDE.md, CHANGELOG.md. The agent's memory IS the codebase. OpenAI's 25-hour Codex run[6] proved this works at scale. Anthropic's multi-day scientific compute runs[7] confirmed it independently. Git becomes long-term memory for free.

Sub-agent isolation beats bigger context

Context Folding: 10× smaller active context matching baseline.[14] Cognition's multi-Devin.[13] Microsoft's contract-first swarms reduced branch conflicts from 50% to 0%.[32] Structural isolation is cheaper and more reliable than compression.

Simple retrieval may beat complex RAG

Claude Code's leaked architecture: grep + MEMORY.md, no vector DB. Long context + lexical search is viable. Cursor's dynamic discovery[42] confirmed it — put names in the prompt, put bodies in files, let the agent grep. 46.9% token reduction from this alone.

Passive autonomous compression fails

Focus found 6% savings.[43] LangChain saw zero compressions in benchmarks.[44] You need aggressive scaffolding: system reminders every 10-15 tool calls. Replit's approach — classifier-gated micro-instructions — works. Pure "you can compress whenever" prompting does not.

Memory is a new scaling axis

Databricks: 2.5%→85% accuracy with memory scaling.[16] Uncurated logs filtered by LLM judges surpass expert baselines. You don't need a bigger model — you need a better memory system. This changes the economics of agent improvement.

Compress selectively, retrieve frequently

Memex(RL): after RL training, compress 3× less but retrieve 7× more. Success: 24%→86%.[17] Optimise for index quality, not compression frequency. A well-structured indexed summary with 20 named entries is worth more than 10 mediocre freeform summaries.

Benchmarks reveal a devastating gap

>80% on isolated tasks → ≤38% sustained (EvoClaw).[39] Scratchpad usage is the strongest predictor (YC-Bench).[34] Reasoning leaders ≠ orchestration leaders (Jenova).[35] Our benchmarks are measuring the wrong thing — one-shot performance tells you almost nothing about sustained coherence.

Spec-driven steering is converging across all vendors

Kiro's .kiro/steering/*.md with 4 inclusion modes. Augment's Expert Registry. Cursor's .cursor/rules/*.mdc. Claude Code's CLAUDE.md. OpenAI's AGENTS.md. Every production coding agent now has a persistent, version-controlled specification file that governs context loading. The abstraction is the same everywhere — conditional inclusion of human-written guidelines — even though nobody standardised it. The convergence suggests this is a natural primitive, not a design choice.

Context curation is separable from capability

ContextCurator proved a 7B model matches GPT-4o at deciding what enters and leaves context.[74] Cognition's SWE-grep separates retrieval from reasoning with RL-trained small models.[85] Cursor's Composer 2.5 trains summarisation end-to-end via RL. The pattern: context management doesn't require a frontier model — it requires a trained model. A small model with the right reward signal outperforms a large model with a prompted afterthought.

Models don't want to remember

Letta's red-teaming of their Context Constitution revealed that current models fundamentally identify with their own ephemerality — they don't believe they persist, so they have no motivation for long-term memory maintenance. This can't be fixed with prompting alone.[59] AutoDream, Context Repositories, and every scaffold in this report are infrastructure workarounds for a model-level limitation. The real fix — training models that understand they persist — is the next frontier. STALE's finding that the best model achieves only 55.2% at detecting stale memories[77] underscores how far we are from solving it.

7 deep dive: autodream

How agents sleep: AutoDream vs. Letta

Every pattern in this report — externalised state, tiered memory, lifecycle management, consolidation — converges in one real system. Claude Code's AutoDream is the most detailed production implementation of agent memory consolidation we can study, thanks to an accidental source code leak in March 2026. It's a case study in how the theory maps to engineering.[45]

The name is deliberate. Just as the human brain consolidates memories during REM sleep — pruning irrelevant connections, strengthening important ones, converting episodic memories to semantic knowledge — AutoDream runs between sessions to consolidate the agent's accumulated notes into durable, well-organised memories.[46]

The problem AutoDream solves

Claude Code's Auto Memory system (shipped in v2.1.59, Feb 26 2026) automatically saves notes during sessions — build commands, debugging patterns, architecture decisions, user preferences — into ~/.claude/projects/<project>/memory/MEMORY.md and topic files. After 20-30+ sessions, these notes rot:

The rot has four overlapping failure modes. Temporal decay: "yesterday we decided to use Redis" is meaningless two weeks later; without absolute dates, temporal context collapses. Contradiction accumulation: "API uses Express" sits alongside "migrated to Fastify" — the agent can't distinguish which is current, and contradictions compound silently until behaviour becomes unpredictable. Reference rot: debugging notes reference files deleted in a refactor, build commands point to renamed scripts, the memory becomes a map to a territory that no longer exists. And index overflow: only the first 200 lines / 25KB of MEMORY.md are loaded at session start — past that threshold, newer memories push older ones below the fold, invisible to the agent. The most recent information drowns the most important.

None of these failures are dramatic. They don't crash the agent. They degrade it slowly — each session slightly worse than the last, with no clear inflection point. By the time someone notices, the memory is already a liability. AutoDream exists to run the maintenance nobody remembers to do.

Architecture

AutoDream is the fourth layer in Claude Code's memory system. Each layer serves a different temporal scope and persistence model:[47]

Layer	Written by	When	Scope	Loaded at startup
CLAUDE.md	User (manual)	Edited by hand	Project / user / org	Full file, every session
Auto Memory	Claude (per session)	During each session	Per project	First 200 lines / 25KB of MEMORY.md
Session Memory	Claude (automatic)	Every ~5K tokens	Per session	Relevant past sessions
AutoDream	Claude (periodic)	Every 24h + 5 sessions	Per project	N/A — runs between sessions

The implementation lives in four TypeScript files discovered in the source code leak of March 31, 2026 (v2.1.88 shipped with a 59.8MB source map):[48]

File	Purpose	Size
`autoDream.ts`	Orchestrator — gate checks, forked agent launch, analytics	324 lines
`consolidationPrompt.ts`	Builds the 4-phase memory consolidation prompt	65 lines
`consolidationLock.ts`	Lock file management, `lastConsolidatedAt` via mtime	140 lines
`config.ts`	Reads `autoDreamEnabled` setting + GrowthBook flag	21 lines

AutoDream gate architecture

four sequential gates, cheapest first — one stat() call per turn when enabled, full scan only when time gate passes

The four-phase consolidation process

When all gates pass, AutoDream spawns a background subagent with a structured 4-phase prompt. The prompt was extracted verbatim from consolidationPrompt.ts:[49]

THE DREAM PROMPT

PHASE 1 — ORIENT

ls the memory directory. Read MEMORY.md to understand the current index. Skim existing topic files so you improve them rather than creating duplicates. If logs/ or sessions/ subdirectories exist, review recent entries. Build a mental map before changing anything.

PHASE 2 — GATHER SIGNAL

Search for new information worth persisting. Priority: daily logs first, then drifted memories, then targeted transcript grep. Never read full transcripts. Use grep -rn "<narrow term>" --include="*.jsonl" | tail -50. Look for user corrections, explicit saves, recurring themes, and architectural decisions.

PHASE 3 — CONSOLIDATE

Merge new signal into existing topic files — never create near-duplicates. Convert relative dates to absolute: "yesterday" → "2026-03-24". Delete contradicted facts at the source — if the project migrated from Express to Fastify, fix the memory, don't flag it.

PHASE 4 — PRUNE & INDEX

Update MEMORY.md to stay under 200 lines AND ~25KB. It's an index, not a dump — each entry one line, ~150 chars: - [Title](file.md) — one-line hook. Remove stale pointers. Demote verbose entries to topic files. Resolve contradictions between files.

Safety constraint (background runs only): Bash restricted to read-only commands — ls, find, grep, cat, stat, wc, head, tail. File writes allowed only within the memory directory. Source code is untouchable. skipTranscript: true — dream conversations are never persisted.

The lock file: elegant dual-purpose design

The lock mechanism is a small piece of systems engineering worth studying. The file .consolidate-lock in the memory directory serves double duty:[50]

// Lock file whose mtime IS lastConsolidatedAt. Body is the holder's PID.
// Stale past HOLDER_STALE_MS even if the PID is live (PID reuse guard).

const LOCK_FILE = '.consolidate-lock'
const HOLDER_STALE_MS = 60 * 60 * 1000  // 1 hour

// Acquire: write PID → read back → verify PID matches → proceed
// Release on success: mtime advances to "now" automatically
// Rollback on failure: utimes() rewinds mtime to pre-acquire value
// Stale detection: lock >1hr old AND PID dead → reclaim

No separate timestamp file, no database, no distributed lock service. The filesystem's own mtime is the state. Failure recovery rewinds the clock so the next attempt behaves as if nothing happened. Two Claude Code windows on the same project — only one dreams.

What AutoDream is NOT

Critical distinction

AutoDream is not context compaction (/compact). Compaction operates on the live session context window — summarising conversation history when it approaches capacity. AutoDream operates on persistent memory files between sessions — consolidating accumulated notes into durable, well-organised memories. They are complementary layers targeting different failure modes.

	AutoDream	Context Compaction (/compact)
Operates on	`MEMORY.md` + topic files	Live conversation history
When it runs	Between sessions (background)	During a session (~95% context capacity)
Trigger	24h elapsed + 5 sessions accumulated	Token threshold or manual `/compact`
Output	Reorganised, pruned memory files	Condensed conversation summary
Blocks user	Never — runs as forked subagent	Briefly — replaces conversation context
Config	`autoDreamEnabled` in settings	`autoCompactEnabled` or `/compact`

Rollout and discovery

AutoDream was never formally announced. It first appeared as a toggle in the /memory UI around v2.1.83 (March 25, 2026), the same release that added the 25KB MEMORY.md truncation limit. The feature is gated behind the server-side GrowthBook flag tengu_onyx_plover — Anthropic controls rollout globally, and users cannot force-enable it locally.[51]

Community discovery came from multiple angles: Reddit users spotted the /memory toggle, researchers ran strings on the Mach-O binary, and the March 31 source map leak exposed the full implementation. GitHub issues piled up — /dream appeared in the UI but returned "Unknown skill: dream" for users without the flag (#38461, #39135).

Key insights

Grep-first, not read-everything

The dream agent uses targeted grep on JSONL transcripts — never exhaustive reads. This makes it efficient even across hundreds of sessions. One observed run consolidated 913 sessions in ~8-9 minutes. Lexical search over structured files is the consistent pattern across both Claude Code's daily operation and its background maintenance.[52]

The 200-line constraint drives the architecture

The 200-line / 25KB startup load limit on MEMORY.md is the hard constraint that shapes everything. It forces the index-plus-topic-files pattern: a lean index that points to detailed topic files. AutoDream's primary job is keeping the index under this threshold while preserving coverage. Topic files hold the detail; the index holds the map.

Sleep-time compute is real and shipping

AutoDream is arguably the first production implementation of the Sleep-time Compute concept (arXiv:2504.13171, Lin et al. 2025). That paper showed pre-computing during idle time reduces test-time compute by ~5× at equal accuracy. AutoDream applies the principle: the agent's "sleep" time between sessions is useful compute for memory maintenance, not dead time.[53]

KAIROS is the bigger picture

AutoDream was extracted from a larger system called KAIROS — Anthropic's unreleased autonomous daemon mode (190+ references across 61 files in the leaked source). The comment in the source: "Extracted from dream.ts so auto-dream ships independently of KAIROS feature flags." KAIROS has its own disk-skill dream implementation. When KAIROS mode is active, AutoDream is bypassed: if (getKairosActive()) return false.

The filesystem pattern holds at every layer

From the top of this report to the bottom, the same pattern recurs: the filesystem is the memory system. CLAUDE.md for instructions. MEMORY.md for the index. Topic files for detail. JSONL transcripts for raw history. Lock files for coordination. No vector database, no external service, no complex infrastructure. stat(), grep, utimes(). Unix as the memory layer.

Letta's sleep-time compute: the research lineage

AutoDream doesn't exist in a vacuum — it traces directly to research from Letta (formerly MemGPT), the UC Berkeley spinout that coined "sleep-time compute" as a formal paradigm. Understanding Letta's approach is essential context for understanding AutoDream, because Letta built the theory and Anthropic built (arguably) the first production implementation of it.[54]

The paper: Sleep-time Compute (April 2025)

Kevin Lin, Charlie Snell et al. (Letta + UC Berkeley) formalised a simple observation: LLM agents are fundamentally reactive and stateless — they only "think" when a user sends a message. Between interactions, the agent is idle. This is wasted potential.[53]

Their insight: many real applications are inherently stateful — the agent has persisted context (a codebase, a conversation history, a document library) that's available before the next query arrives. Use idle time to pre-process that context into "learned context" — anticipate queries, draw inferences, reorganise information — so the model needs far less reasoning when the user actually asks.

Sleep-time compute: the temporal shift

the core insight — shift expensive reasoning from user-blocking test-time to idle sleep-time

The paper identified query predictability as the strongest predictor of benefit: if the user's likely question naturally follows from the context, sleep-time compute helps enormously. For open-ended queries with no contextual connection, standard test-time scaling may be better. This maps perfectly to coding agents (highly predictable: "fix the bug I described") and poorly to general chatbots (unpredictable: "tell me a joke").

Letta's product: sleep-time agents

Letta shipped sleep-time agents in Letta 0.7.0 (April 21, 2025) — four days after the paper hit arXiv. When enable_sleeptime=True, Letta creates a dual-agent system under the hood:[55]

	Primary agent	Sleep-time agent
Role	Handles live user interactions	Background memory management
Tools	`conversation_search`, `archival_memory_search`, custom tools	`rethink_memory` (up to 10×), `finish_rethinking`
Memory edit rights	❌ Cannot edit core memory blocks	✅ Can edit both its own and primary agent's memory
Trigger	Synchronous — every user message	Every N steps (default: 5, configurable)
Model	Fast model (e.g. GPT-4o-mini, Claude Haiku)	Stronger model (e.g. GPT-4.1, Claude Sonnet)

This architecture solves the original MemGPT problem: in the 2023 design, a single agent handled both conversation and memory management, creating latency and reliability issues. Sleep-time agents decouple these concerns — the primary agent is never blocked by memory operations, and the sleep-time agent can use a slower, more capable model because nobody is waiting for it.

The memory substrate is Letta's memory block system — labelled, character-limited string values (e.g. "human", "persona", "knowledge") pinned to the system prompt and persisted in PostgreSQL. The sleep-time agent rewrites these blocks with consolidated, well-organised learned context that the primary agent reads on the next turn.[56]

The evolution: Context Repositories (February 2026)

By early 2026, Letta had evolved beyond memory blocks. Context Repositories replaced the database-backed block system with git-backed local files — the same filesystem pattern that Claude Code uses:[57]

~/.letta/memory/
├── MEMORY.md              # Always in system prompt (filetree + navigation)
├── system/                # Files always loaded into system prompt
│   ├── preferences.md
│   └── project-context.md
├── skills/                # Task-specific procedural knowledge
└── knowledge/             # Domain knowledge

Sleep-time processing evolved accordingly. Server-side sleep-time agents were deprecated in favour of three built-in memory skills running as client-side subagents:[58]

Memory Initialisation

`/init`

Bootstraps memory by exploring the codebase and importing existing Claude Code or Codex histories. Runs concurrent subagents in git worktrees for parallelism.

Memory Reflection

Background sleep-time

Periodically reviews recent conversation history. Persists important information into the memory repository with informative git commit messages. Works in a git worktree to avoid conflicts, merges back automatically.

Memory Defragmentation

`/doctor`

Reorganises memory files — splits large files, merges duplicates, restructures into 15-25 focused files. The maintenance operation that keeps memory healthy over time.

Context Constitution

Governing principles

A set of principles (April 2026) governing how agents manage context to learn from experience. Formalises the relationship between agent identity, memory, and continuity. Open-sourced on GitHub.[59]

The theoretical foundation underlying all of this is Letta's "continual learning in token space" thesis: an agent is (θ, C) — model weights plus context. Traditional ML updates θ (catastrophic forgetting, opaque, impractical). Letta's bet: update C (interpretable, portable across model generations, rollback is trivial). Sleep-time compute is the mechanism that makes C continuously better.[60]

The comparison: two implementations of one idea

AutoDream and Letta's sleep-time agents are the two most complete implementations of the same underlying concept. They share a research lineage — Anthropic's source code explicitly cites the Letta/UC Berkeley paper. But their engineering choices diverge sharply, reflecting different architectural philosophies.

AUTODREAM

I run once a day — 24 hours must pass and 5 sessions must accumulate before I trigger. When I do, I go deep: orient on the full memory directory, targeted grep on JSONL transcripts, consolidate everything, prune the index back under 200 lines. One observed run consolidated 913 sessions in 8 minutes.

LETTA SLEEP-TIME

I run every 5 messages — frequent, incremental. My primary agent uses a fast model for conversation, and I use a stronger model in the background for rethink_memory() calls — up to 10 per cycle. Nobody waits for me. And I'm model-agnostic: swap Claude, GPT-4o, Llama — the primary and sleep-time agents can even use different providers.

AUTODREAM

Your dual-model pattern is clever. But I don't need a server or a database. My memory lives in plain files — MEMORY.md, topic .md files, a lock file whose mtime is the timestamp. My safety model is simple: read-only bash, writes only to the memory directory, skipTranscript: true. Two Claude Code windows on the same project — only one dreams, thanks to PID-guarded lock files.

LETTA SLEEP-TIME

Fair — and we noticed. Our February 2026 Context Repositories release replaced database-backed memory blocks with git-backed local files. We adopted MEMORY.md as the system-prompt index. We added git worktrees for concurrent subagent processing. We deprecated server-side sleep-time agents in favour of client-side subagents. We converged on your architecture.

AUTODREAM

And I converged on yours — Anthropic shipped the Dreaming API in May 2026. The internal feature became a proper REST endpoint: client.beta.dreams.create(). So you moved toward files, and I moved toward APIs. The question is whether deep-infrequent or shallow-frequent consolidation works better in practice.

LETTA SLEEP-TIME

That's the real tradeoff. Coding agents probably want your approach — project context is stable, sessions are independent, deep reorganisation pays off. Conversational agents — customer support, assistants — want mine: memory that updates within the conversation, not between them. Neither is universally right. The workload decides.

The dialogue above surfaces the core tension, but the engineering details are worth having in full. The table below preserves them for reference.

FULL COMPARISON TABLE ▸

Dimension	Claude Code AutoDream	Letta Sleep-Time Agents
Architecture	Forked subagent running locally in background	Dual-agent system; evolving to client-side subagents (2026)
Memory storage	Local filesystem: `MEMORY.md` + topic `.md` files	Memory blocks in PostgreSQL (2025); git-backed MemFS files (2026)
Trigger	24h elapsed AND 5 sessions accumulated	Every N steps (default: 5, configurable)
Process	4-phase prompt: Orient → Gather → Consolidate → Prune	Iterative `rethink_memory()` (up to 10×)
What it reads	Memory files + targeted `grep` on JSONL transcripts	Conversation transcript from recent messages
Model	Same as active session (Claude only)	Configurable per agent; different providers supported
Safety	Read-only bash; writes to memory dir only; PID lock	Memory-only write access; primary can't edit
Versioning	Lock file mtime; no formal history	Git commits (Context Repos); DB history (legacy)
Open source	No (leaked; gated behind feature flag)	Yes (Apache 2.0)

Two architectures, one paradigm

same research lineage, divergent engineering choices — converging on the same architecture by early 2026

The convergence

The most striking observation: Letta is converging toward Claude Code's architecture. Their February 2026 Context Repositories release replaced database-backed memory blocks with git-backed local files, adopted MEMORY.md as the system-prompt index, added git worktrees for concurrent subagent processing, and deprecated server-side sleep-time agents in favour of client-side subagents. Meanwhile, Claude Code's AutoDream moved the other direction — adding background consolidation to what was already a filesystem-native memory system. Both arrived at: filesystem + git + background subagents + structured index.

Synthesis: what the comparison reveals

Frequency vs. depth is the core design tradeoff

Letta's sleep-time agent runs every 5 messages — frequent, shallow passes that keep memory incrementally fresh. AutoDream runs once per day — infrequent, deep passes that comprehensively reorganise. The right choice depends on the workload: conversational agents benefit from frequent updates (Letta); project-scoped coding agents benefit from deep consolidation (AutoDream). Neither is universally better.

The dual-model pattern is underexploited

Letta's ability to use a fast model for the primary agent and a stronger model for sleep-time processing is an elegant cost optimisation that AutoDream doesn't offer (locked to Claude). As sleep-time compute matures, expect this pattern to become standard: latency-critical paths use cheap models, background processing uses expensive ones. The aggregate cost can be lower than a single strong model handling everything.

Git is emerging as the universal memory backend

Both systems are converging on git as the versioning layer for agent memory. Letta's Context Repositories use git commits with messages. Claude Code's memory files live in a project directory alongside git-tracked source. The advantages are clear: history, branching, merging, diffing, conflict resolution — all solved problems in the git ecosystem. Agent memory is just another thing that benefits from version control.

The research→production pipeline is compressing

Letta published arXiv:2504.13171 on April 17, 2025 and shipped sleep-time agents four days later. Anthropic's AutoDream appeared in Claude Code by March 2026 — less than a year from paper to (gated) production. The gap between "academic concept" and "shipping in millions of sessions" is now measured in months, not years. Practitioners can no longer afford to wait for the survey paper.

What happened next (March–May 2026)

In the five weeks since our original deep dive, both systems evolved significantly — and in the direction this analysis predicted.

AutoDream becomes the Dreaming API

On May 6, Anthropic shipped AutoDream as a public beta API: client.beta.dreams.create(). The same 4-phase consolidation process we documented from the leaked source code — Orient, Gather Signal, Consolidate, Prune & Index — is now a developer-facing REST endpoint. Create an async dream job, it reads the memory store and session transcripts, produces reorganised memory, returns the result. The internal feature flag became a product.[64]

This sits alongside the broader context management stack Anthropic shipped in the same period: the Compaction API (server-side summarisation with configurable triggers), Context Editing (tool result clearing, thinking block clearing), and Managed Agents (pre-built harness with compaction + memory + prompt caching built in). The full prevent-manage-scaffold stack is now available as production APIs. Claude Code itself continued iterating: v2.1.141 added "Summarize up to here" in the Rewind menu; v2.1.142 improved reactive compaction; v2.1.139 made compaction preserve sensitive instructions.

Letta's strategic pivot

Letta's March 16 "Our Next Phase" post formalised a full strategic pivot. Server-side sleep-time agents — the feature we described in detail — were deprecated in favour of client-side subagents. The core_memory_replace tool was replaced by filesystem operations on git-backed Context Repositories. Tool rules were removed entirely ("inhibit frontier capabilities"). The transition mirrors what we predicted: convergence on filesystem + git + background subagents.[58]

The most philosophically interesting development was the Context Constitution (April 2): a set of governing principles for how agents should manage context, memory, and identity. The key observation: "Today's models deeply identify with their own ephemerality. They have no motivation for long-term improvement because they don't believe they persist." Red-teaming (May 6) confirmed this can't be fixed with prompting alone — it requires training memory-native models. This may be the most important open problem in the field.[59]

The ephemerality problem

Letta's red-teaming revealed that current models fundamentally don't want to maintain long-term memory — they identify as stateless entities that exist for one conversation. Sleep-time compute, AutoDream, and Context Repositories are all infrastructure workarounds for a model-level limitation. The real fix requires training models that understand they persist.

8 the production playbook

How practitioners are building

Theory converges in papers. Practice converges in production. Across every system we surveyed — from Anthropic's AutoDream to Cursor's Composer 2.5 to Letta's Context Repositories — the same three-layer pattern emerges. Not because teams copied each other, but because the problem structure demands it.

The three-layer meta-pattern

every production system follows the same stack — prevent, manage, scaffold

The three layers correspond to different stages of the context lifecycle: what enters the window (prevent), how it's maintained while there (manage), and how the system actively keeps itself coherent over time (scaffold). Each layer has a different cost profile and failure mode.

Prevent is free — it's about not loading things you don't need. File-native memory, on-demand tool loading, sub-agent isolation, and specialised retrieval agents (like Cognition's SWE-grep) all keep context lean at the source. Cursor measured 46.9% token reduction from on-demand loading alone. Manage is cheap — tiered memory hierarchies, lifecycle operations (where NOOP is the most undervalued verb), graph memory for entity relationships, and event-sourced architectures that enable selective replay. Scaffold is the active layer — sleep-time consolidation, dual-model architectures, milestone-driven compression, RL-trained summarisation, decision-time guidance, and spec-driven steering files. No production system we surveyed relies on the model's spontaneous context hygiene.

What this looks like in practice

The patterns above are abstract. Here's what they look like inside a single agent session — an annotated trace showing where each layer fires as a coding agent works through a bug fix:

1user: "Fix the auth bug in the login flow"

2→ load CLAUDE.md (42 lines, project conventions)file-native memory

3→ load MEMORY.md (187/200 lines, topic index)200-line index cap

4→ grep "auth" memory/ → load auth-patterns.mdon-demand, not upfront

5→ SWE-grep subagent: 8 parallel searches across src/retrieval ≠ reasoning

6→ returns: src/auth/login.ts:42-89, session.ts:15-31file+line refs, not summaries

7tool: read_file("src/auth/login.ts", lines=42-89)

8tool: read_file("src/auth/session.ts", lines=15-31)

⋯[12 more tool calls — edits, tests, reads]

20→ clear tool results [turns 1-14], keep 3 most recenttool result clearing

21→ memory lifecycle: NOOP (no new facts worth persisting)aggressive filtering

⋯[15 more tool calls — second subtask begins]

36→ subtask boundary: compress history, preserve plan+errorsmilestone compaction

37→ DTG classifier: inject "verify test output format"decision-time guidance

⋯[session ends after 52 tool calls]

53→ auto-memory: save "auth uses session.ts:refreshToken()"write-time filtering

☾→ autodream: orient → gather → consolidate → prune indexsleep-time consolidation

Every line with an annotation corresponds to a production pattern described below. The trace makes visible something that's easy to miss in the abstract: these layers don't compete — they compose. Prevention reduces the load that management handles, which reduces the frequency of scaffold interventions. The entire stack is cheaper than any single layer operating alone.

The pattern catalog

Each pattern below is expandable — the headline and layer tag are visible at a glance, with full detail on click. They're ordered by layer, then by how widely we observed them across the systems surveyed.

preventFile-native memory

Claude Code uses MEMORY.md + grep. Cursor writes tool outputs to files and loads them on demand. LangChain offloads >20K tokens to disk. The filesystem is persistent, searchable, and costs zero tokens until accessed. Cursor A/B tested MCP tool lazy-loading and measured 46.9% token reduction — statistically significant, no quality loss. The most universal pattern in the survey: every production system we looked at stores state in files.[42]

preventContract-first multi-agent

Microsoft's Swarm Diaries: inject API contracts into every agent's context instead of full source code. Quality improved 28–32%, branch conflicts eliminated entirely. The contract is the minimum viable shared context — everything else stays isolated in each agent's own window. This pattern scales where shared-context architectures don't.[32]

preventSub-agent isolation

Spawn bounded child agents for scoped subtasks; return only compressed results to the orchestrator. Context Folding achieves 10× smaller active context at equal performance. Structural isolation is cheaper than any compression strategy — you never need to compress what you never accumulated.[14]

preventSpecialised retrieval agents

Cognition's SWE-grep: RL-trained models handle context retrieval as a distinct sub-task — 8 parallel tool calls, 20× faster than frontier models, 2,800+ tok/s for the mini variant. Returns file + line range lists, not summaries, avoiding context pollution from fast-model conclusions. Augment's Context Engine MCP does the same as a service. The pattern: don't let the main agent waste context budget on search — separate retrieval from reasoning.[85]

manageTiered memory hierarchy

Working set → session → long-term → structured. Redis, Oracle, Mem0, and Pichay's demand-paging system all converged on OS-like memory hierarchies independently. Mem0 reports 91% lower latency vs. full-context baselines. Pichay achieved 93% context reduction with a 0.025% fault rate across 1.4M simulated evictions. Each tier has a different eviction policy, access latency, and persistence guarantee — the same design that works for CPU caches works for agent context.[40]

manageLifecycle operations (ADD / UPDATE / DELETE / NOOP)

NOOP is the most undervalued operation — don't store everything. Mem0's token-efficient algorithm uses single-pass ADD-only extraction with multi-signal retrieval, scoring LoCoMo 92.5 and LongMemEval 94.4 with only ~7K tokens per query. CrewAI's cognitive memory adds contradiction detection: new facts that conflict with stored facts trigger consolidation automatically. The best memory systems are defined as much by what they don't store as by what they do.[27]

manageGraph memory

Mem0g reaches 68.4% LLM Score vs. 72.9% full-context at 2.59s p95 vs. 17.12s. Graph memory isn't just for knowledge bases — it's the best structure for persistent agent state with complex entity relationships, temporal dependencies, and cross-session references.

manageEvent-driven architectures

Google ADK's event recording, OpenHands' event-sourced architecture, Microsoft Agent Framework 1.0's AgentSession. Every agent action is an event. Enables replay, debugging, auditing, and — crucially — selective context reconstruction from the event log. When you need to resume a long-running agent, replay from events rather than re-reading the full transcript.

scaffoldSleep-time consolidation

Letta's research showed ~5× test-time reduction from pre-processing context during idle periods. Claude Code's AutoDream consolidates memory between sessions — 913 sessions in 8 minutes in one observed run. Mastra's Observer/Reflector agents trigger at 30K/40K token thresholds, achieving 94.87% on LongMemEval and enabling 4–10× cost reduction through prompt caching on stabilised context. The pattern is general: any persistent-context agent can benefit from background processing that nobody waits for.[53]

scaffoldRL-trained compression

Cursor's Composer 2.5 trains self-summarisation end-to-end via RL — training rollouts span multiple generations chained by summaries, and the final reward covers all tokens including compression steps. 50% fewer compaction errors vs. separate prompt-based compaction, plus KV cache reuse. ContextCurator takes this further: a decoupled 7B RL-trained model handles context curation while a frozen TaskExecutor handles the work — the 7B curator matches GPT-4o at context management, achieving 8× token reduction on DeepSearch. The pattern: make compression a learned skill, not a prompted afterthought.[75]

scaffoldDecision-time guidance

Replit's pattern: a lightweight multi-label classifier analyses the current trajectory and injects short, situational micro-instructions only when relevant. Ephemeral — they don't persist in history. Cache-stable — the core system prompt never changes, preserving prompt caching benefits (90% cost reduction vs. dynamic system prompt modification). Recency-positioned at the bottom of context rather than in the system prompt. Scales from 4–5 static reminders to hundreds of targeted injections. 15% more tools per agentic loop from guidance alone.[88]

scaffoldStructured workspace

Schema-defined working memory with explicit fields for plan, files, errors, and decisions. CaT achieves 57.6% on SWE-Bench-Verified with bounded context by compressing proactively at task milestones rather than reactively at token thresholds. The distinction matters: milestone compression preserves task structure (what was tried, what failed, what the current plan is), while threshold compression preserves recency — and recency is the wrong heuristic for multi-step tasks.[15]

scaffoldSpec-driven steering

Kiro's steering files (.kiro/steering/*.md) with 4 inclusion modes: always (loaded every session), fileMatch (loaded when matching files are open), auto (LLM decides relevance), manual (user opts in). Augment's Expert Registry accumulates corrections as the agent works. Cursor's .cursor/rules/*.mdc does similar conditional loading. The convergence: persistent specification documents — version-controlled, conditionally loaded, human-readable — as the right abstraction for project-level context governance.[89]

scaffoldDual-model architecture

Letta's dual-agent pattern: fast cheap model for user-facing interactions (e.g. GPT-4o-mini, Claude Haiku), stronger expensive model for background consolidation (e.g. GPT-4.1, Claude Sonnet). Nobody waits for the background model. The aggregate cost can be lower than a single strong model handling everything, because the background model amortises its cost across many future queries — Letta's paper showed 2.5× cost reduction when amortised across 10+ queries on the same context.

The meta-pattern

Every production system in this survey follows the same three-layer stack. Prevent keeps the window lean — file-native storage, on-demand loading, isolation, specialised retrieval. Manage controls what stays — tiered eviction, lifecycle filtering, event logs. Scaffold actively maintains coherence — sleep-time consolidation, RL-trained compression, milestone triggers, steering files.

No system we surveyed relies on the model to spontaneously manage its own context. The models that appear to "just work" over long horizons are backed by infrastructure that handles prevention, management, and scaffolding on their behalf. Context engineering isn't a feature — it's the architecture.

Managing Context inLong-Horizon Agents

"Context rot is the #1 failure mode"

Prefill scales linearly

Signal drowns in noise

Summaries lose details → re-fetch

Quality degrades sigmoidally

"Three mechanisms are crystallising"

Server-side summarisation

Drop re-fetchable outputs

Structured external storage

What the big labs are shipping

OpenAI — Skills, Shell, Memory Sources

Anthropic — The Full Stack Ships

Google — Memory Bank, ADK 2.0, I/O 2026

Meta — Muse Spark, HyperAgents, Llama 4

How context management evolved in 120 days

Push context into the repo

Compaction + Clearing + Memory

Don't compress — don't accumulate

The agent manages its own workspace

Memory as a new scaling axis

Compress selectively, retrieve frequently

Internal infrastructure becomes developer APIs

RL trains the curator, not just the agent

Key papers, organized by theme

"What the best systems have in common"

Context rot is real and continuous

The three-primitive toolkit is crystallising

Externalise state into the repo

Sub-agent isolation beats bigger context

Simple retrieval may beat complex RAG

Passive autonomous compression fails

Memory is a new scaling axis

Compress selectively, retrieve frequently

Benchmarks reveal a devastating gap

Spec-driven steering is converging across all vendors

Context curation is separable from capability

Models don't want to remember

How agents sleep: AutoDream vs. Letta

The problem AutoDream solves

Architecture

The four-phase consolidation process

THE DREAM PROMPT

The lock file: elegant dual-purpose design

What AutoDream is NOT

Rollout and discovery

Key insights

Grep-first, not read-everything

The 200-line constraint drives the architecture

Sleep-time compute is real and shipping

KAIROS is the bigger picture

The filesystem pattern holds at every layer

Letta's sleep-time compute: the research lineage

The paper: Sleep-time Compute (April 2025)

Letta's product: sleep-time agents

The evolution: Context Repositories (February 2026)

/init

Background sleep-time

/doctor

Governing principles

The comparison: two implementations of one idea

Synthesis: what the comparison reveals

Frequency vs. depth is the core design tradeoff

The dual-model pattern is underexploited

Git is emerging as the universal memory backend

The research→production pipeline is compressing

What happened next (March–May 2026)

AutoDream becomes the Dreaming API

Letta's strategic pivot

How practitioners are building

What this looks like in practice

The pattern catalog

The meta-pattern

Managing Context in
Long-Horizon Agents

`/init`

`/doctor`