2026-06-09 · Research · agent skills, memory, evals, security

Skill Libraries Need CI, Not More Prompts

In April, agent skills looked like prompt snippets with a better folder structure. Eight weeks later, the useful frame is stricter: skills are persistent agent state. Persistent state needs versioning, eval gates, provenance, quarantine, and garbage collection.

automatic skill learning skill operations verification agent memory supply chain

The hard problem is no longer how to write a useful SKILL.md. That part is becoming standardized, documented, and productized. The hard problem is how a system should decide that a skill deserves to exist, when it should load, how it should change, who is allowed to publish it, what evidence travels with it, and when it should be killed.

The last two months made that shift visible. OpenAI Codex adopted the open skill format while separating skills from installable plugins. GitHub added package-manager commands for installing, pinning, updating, and publishing skills. Perplexity published the most concrete maintenance guide: write evals first, make descriptions routing metadata, add negative trigger examples, and treat every sentence as a context tax. NVIDIA shipped verified agent skills with scanning, signing, and skill cards. Papers such as SkillOps, SkillOpt, PACE, SkillRevise, Workflow-to-Skill, MemoRepair, and OpenSkillEval moved the research question from “can agents remember procedures?” to “can a library of procedures maintain itself without drifting, poisoning retrieval, or p-hacking its own verifier?”

That is a different object than an instruction file. A mature skill is a context package: prose, references, scripts, allowed tools, trigger policy, tests, provenance, risk metadata, utility history, and a decision log. The package is read by a model, but it is governed like software.

Figure 01 · object model

The unit is not a prompt. It is governed persistent state.

Thesis map

Old frame

SKILL.md
prompt text

Actual artifact

Instructionsbehavior the model reads

Trigger policyload / do-not-load cases

Permissionstools, approvals, sandbox

Evidencewith/without reports

Provenancerepo, ref, signature

Maintenancerepair, retire, delete

Lifecycle rails

versionpin / diff

evaluategate

scanrisk

signpublish

monitorrepair / retire

Reader model: once a skill can be installed, selected automatically, and updated, it behaves like persistent operational state. The rest of the article is about the lifecycle around that state.

1 what changed since april

The center of gravity moved from authoring to operations

The April version of this article treated skills as the emerging distribution unit for agentic AI. That was directionally right and operationally incomplete. Distribution creates a new failure mode: once skills can be installed, shared, updated, and invoked automatically, the library becomes a software ecosystem. Ecosystems accumulate technical debt.

The clearest recent evidence clusters around four moves. Standards made skills portable. Package managers made them installable. Eval systems made their value measurable. Security tools made their attack surface explicit. None of those moves makes skills autonomous in the magical sense. They make skills inspectable enough to maintain.

Source	Date / signal	Mechanism that changed the article
AgentSkills specification	Active cross-vendor standard	A skill is a folder with required `SKILL.md` and optional scripts, references, and assets. Progressive disclosure is first-class: metadata first, full instructions on activation, resources as needed.
OpenAI Codex skills	Recent official docs	Codex treats skills as workflow authoring artifacts and plugins as installable distribution units. It also documents a startup-metadata budget, so large libraries face a real selection cliff.
GitHub `gh skill`	2026-04-16 changelog	Install, search, update, publish, pin to tag or SHA, and store repository/ref/tree-SHA provenance in frontmatter. The npm analogy now has concrete lockfile-like semantics.
Perplexity skill maintenance guide	2026 article, practitioner guide	Skills are eval-first. Descriptions should say “Load when…” and target user intent. Negative examples prevent false loads. Gotchas accumulate the highest-value maintenance content.
Addy Osmani, Agent Skills	2026-05-03, HN 376 points / 212 comments in the research scan	Skills encode engineering process, not just facts. The useful examples force specs, tests, reviews, anti-rationalization checks, and exit criteria that agents skip by default.
SkillOps	2026-05-13, arXiv / NeurIPS submission	Defines skill technical debt, typed skill contracts, a hierarchical ecosystem graph, and library health dimensions across utility, compatibility, risk, and validation.
SkillOpt	2026-05-22/25, Microsoft release	Treats skill text as trainable external state. The optimizer proposes bounded edits, keeps rejected-edit history, and commits only when held-out validation improves.
PACE	2026-06 arXiv	Shows that greedy “keep if score went up” self-edits create adaptive multiple-testing failure. The acceptor needs statistical discipline, not just a plausible evaluator.
Workflow-to-Skill	2026-06 arXiv	Trace-to-skill is not summarization. It compiles routing, workflow, semantics, attachments, verification, rollback, and confidence annotations into a reusable artifact.
NVIDIA Verified Agent Skills	2026-05-19 official blog	Production skills get reviewed, scanned, signed, documented with skill cards, and synchronized through a catalog. Capability governance enters the skill lifecycle.
OpenSkillEval and agent-skills-eval	2026-05 onward	A skill existing and a skill helping are separate claims. Eval runners compare with-skill against without-skill and store prompts, traces, grades, timings, and reports.
YC Company Brain RFS	Summer 2026 funding signal	YC explicitly asks for company knowledge that stays current and turns into executable AI skill files. The market wants operational knowledge, not another document chatbot.

Popularity matters here because skills are partly an ecosystem problem. Hacker News supplied the useful skepticism: high-point threads on Forge, Claude Code daily-driver workflows, TDD skills, browser harnesses, and state-machine agents converged on the same practical stance. Skills help when paired with deterministic checks, evals, small blast radius, and versioning. Pure prose gets ignored.

GitHub supplied adoption heat, though stars should be read as heat rather than proof. The strongest repos and tools are not just skill catalogs. They are package managers, eval runners, scanners, provenance systems, and docs-to-skill compilers. That is the structural signal: the ecosystem is building the maintenance layer around the format.

2 portability is not lifecycle

The open standard solved syntax, not correctness

The open AgentSkills format is intentionally thin. A directory contains a required SKILL.md. The markdown file carries frontmatter, most importantly a description that lets an agent decide when to load the full instructions. Extra files live under resources such as scripts/, references/, and assets/. That shape is useful because it separates routing metadata from heavier context.

Thinness is a feature at the standard layer. It lets Codex, Claude Code, Gemini CLI, GitHub Copilot, Cursor, Windsurf, and other agents converge on a common package shape without agreeing on one governance regime. The format says where the instruction lives. It does not say whether the instruction is safe, current, useful, original, compatible with the installed package version, or worth loading.

The missing fields

The core format does not make owner, risk tier, eval status, provenance, signature, dependency version, deprecation state, utility history, or negative trigger history first-class. Those fields belong above the standard, in the lifecycle layer.

Codex makes the split explicit. Skills are the authoring format. Plugins are the installable distribution unit. That distinction matters because authoring, packaging, discovery, installation, activation, and verification have different failure modes. A good SKILL.md can still be installed from the wrong source, selected for the wrong task, executed with the wrong permissions, or kept long after the underlying API changed.

The selection cliff is now documented product behavior, not just a benchmark observation. Codex loads initial skill metadata into context but limits that startup list. When a library grows, descriptions get shortened or omitted. A library with too many plausible skills becomes a retrieval problem before it becomes a task-solving problem.

3 the production shape

A production skill is a context package

Calling a skill “a prompt” hides the engineering. A prompt is one string. A production skill is a bundle that changes model behavior, tool access, evidence collection, and sometimes local execution. It should be reviewable before activation and auditable after use.

skill/
  SKILL.md                  # routing metadata plus instructions
  references/               # docs, examples, schemas, API notes
  scripts/                  # optional executable helpers
  evals/
    trigger-cases.jsonl     # should-load and should-not-load prompts
    task-fixtures/          # inputs, repos, browser states, data
    assertions.json         # deterministic checks and rubric specs
  reports/
    latest-with-vs-without.json
  SKILLCARD.yaml            # owner, risk, limitations, verification status
  provenance.json           # source repo, ref, tree SHA, signature
  decision-history.md       # accepted and rejected edits

Figure 02 · package anatomy

A production skill is a folder plus a lifecycle envelope

AgentSkills · SkillOps · GitHub provenance

skill/

SKILL.mdmetadata, trigger description, progressive instructions

references/schemas, product docs, API notes, examples

scripts/optional helpers with explicit permission and review path

evals/should-load, should-not-load, fixtures, deterministic assertions

reports/with-skill vs without-skill traces, judge decisions, timing

SKILLCARDowner, risk tier, limitations, verification status

provenancerepo, ref, tree SHA, signature, source lineage

Routing

Load policy

Description, positive triggers, negative examples, scope, freshness, and context budget decide whether the model should even see the skill.

Execution

Permission boundary

Tool access, approval points, sandbox requirements, outputs, and handoff contracts constrain what the instruction can unlock.

Evidence

Verification record

Eval reports, trace IDs, repeated runs, process assertions, and accepted/rejected edit history travel with the package.

Operations

Maintenance state

Version, provenance, signature, risk, owner, deprecation, quarantine, utility history, and deletion lineage live above the core standard.

Why it belongs here: the open format standardizes the file shape. The lifecycle envelope is where correctness, safety, evidence, and retirement become enforceable.

This shape explains why Contentful Skill Kit is interesting. It does not treat every skill as prose. Workflow skills are typed state machines with schemas, transitions, actions, saved state, and generated AgentSkills-compatible output. Reference skills remain progressive-disclosure topic loaders. Composite skills combine subskills and topic references. That is the right boundary: skill packages can be declarative workflow controllers, not only long markdown files.

The contract inside the package should be small enough for tooling to understand. SkillOps uses preconditions, operations, artifacts, validity, and failure modes. Contractual Skills adds goals, boundaries, permissions, human approval points, evidence requirements, output contracts, quality criteria, verification steps, and handoff rules. The common direction is clear: the model can read natural language, but the library needs structured fields for policy and maintenance.

Trigger

When should this load?

Descriptions are routing metadata. Perplexity’s rule is right: start with “Load when…”, include user intent, and add forbidden loads so adjacent skills do not steal traffic.

Permission

What can this unlock?

A data-export skill, a browser-automation skill, and a formatting skill should not have the same trust path. Tool access belongs in the contract, not in incidental prose.

Evidence

Why believe it helps?

Every package should point to with-skill versus without-skill evals, negative trigger cases, deterministic assertions, and the latest report that justified publication.

Provenance

Where did it come from?

GitHub’s tree-SHA provenance and pinning point in the right direction. A loaded skill version should be traceable to a source ref and review state.

A contract does not make a skill safe by itself. It makes the boundary inspectable. Runtime enforcement still has to live in the loader, permission system, sandbox, evaluator, and audit log.

4 trace distillation plus rejection

Skill learning is not transcript summarization

The shared learning loop is now recognizable across papers and products: run a task, capture the trajectory, diagnose what mattered, draft a reusable artifact, evaluate it against future tasks, and publish only if it survives rejection. The danger sits in the word “capture.” Raw trajectories contain accidents, one-off environment details, lucky guesses, stale API responses, and task-specific scaffolding. Dumping them into memory is a way to preserve noise.

Figure 03 · trace-to-skill compiler

Workflow-to-Skill is closer to compilation than summarization

Workflow-to-Skill · SkillRevise

Raw trajectory

task prompt · environment · model state

tool calls · browser state · files changed

errors · detours · lucky guesses

artifacts · verifier output · user feedback

one-off details that should not survive

Routing IRintent, trigger, negative trigger, scope

Workflow IRsteps, tools, state, rollback, approvals

Semantic IRconcepts, constraints, examples, exceptions

Attachment IRdocs, scripts, fixtures, evidence

Candidate artifact

SKILL.md with a precise load description

contract fields for tools, approvals, outputs

eval fixtures and negative trigger cases

anchored repairs from execution evidence

published only after rejection pressure

1Capture the run

Keep the trajectory, artifacts, failures, verifier outputs, and user corrections. Do not treat the whole transcript as policy.

2Extract an intermediate representation

Separate routing, control flow, semantics, attachments, state, rollback, and evidence before writing prose.

3Draft, repair, and reject

The candidate skill should encode repeatable behavior under a bounded trigger, then survive execution-anchored repair and held-out rejection.

Reference architecture: Workflow-to-Skill decomposes traces into routing, workflow, semantics, and attachments; SkillRevise adds execution-anchored repair. The diagram abstracts both mechanisms without claiming either paper ships this exact pipeline.

Two loops, different latency budgets

Runtime loop

discover load execute observe verify

Library loop

collect traces diagnose propose edit evaluate scan sign publish monitor retire or repair

Workflow-to-Skill is useful because it names the intermediate representation that summarization skips. It decomposes evidence into routing, workflow, semantics, and attachments. It preserves control flow, verification, safety, rollback, state management, evidence, and confidence annotations. That is closer to compilation than summarization. A summary says what happened. A compiled skill says what future agent behavior should be repeatable, under which trigger, with which checks.

SkillRevise addresses a different state: the cold-start skill that already exists but is imperfect. It diagnoses defects from execution evidence, retrieves repair principles, applies anchored edits, and re-executes candidates. Its reported SkillsBench improvement, from 36.05% to 61.63%, matters less than the mechanism: a skill draft is not trusted because it is plausible. It becomes trusted through execution-anchored repair.

GEPA is the right adjacent analogy. Text artifacts can be optimized against eval metrics using reflection and evolutionary search. Execution traces, errors, profiling data, and reasoning logs supply directional information for textual edits. They are not gradients in the mathematical sense, but they play the same practical role: they tell the optimizer what kind of mutation might fix the artifact.

The useful verb

Do not “summarize traces into skills.” Compile traces into candidate artifacts, then reject most candidates. The rejection step is what keeps experience from turning into folklore.

5 the acceptor

The commit gate is more important than the proposer

Self-improving systems tend to focus attention on the proposer: the model that reflects on failures and writes a better instruction. Recent work points in the opposite direction. The acceptor is the safety-critical component. A strong proposer with a weak gate writes convincing slop into persistent state.

PACE makes the statistical problem explicit. Greedy acceptance, “keep the edit if the score went up,” is vulnerable to adaptive multiple testing. If an optimizer tries enough candidate edits against noisy evals, some edits will look good by chance. PACE compares candidates to incumbents on identical instances and commits only when an anytime-valid e-process accumulates decisive evidence. That shifts self-improvement from vibes to a commit rule.

Figure 04 · acceptor architecture

The proposer writes candidates; the acceptor protects persistent state

SkillOpt · PACE · OpenSkillEval

Optimizer / proposer

patch A · clearer triggertest

patch B · more stepshold

patch C · remove stale APItest

rejected-edit bufferlearn

Paired eval instances

task class 01inccand

held-out task✓✓

negative trigger✓×

process assertion✓✓

security scan✓✓

Catalog mutation

Reject / archive

Keep rejected edit history so the optimizer does not rediscover the same false improvement.

Accept / sign

Only cross the boundary when paired evidence, negative controls, and security checks agree.

1Generate bounded edits

SkillOpt’s useful separation is that the target model stays frozen. The optimizer proposes text edits; it does not write the catalog.

2Compare against the incumbent

PACE’s useful move is paired evaluation: candidate and incumbent face the same instances instead of trusting a noisy aggregate score.

3Accumulate evidence under repeated testing

Adaptive search can find lucky edits. The gate needs held-out instances, negative controls, and statistical discipline.

4Commit, reject, or quarantine

Persistent state changes only when the acceptor can explain why the edit helps and what evidence would make it roll back.

Design bet: automatic skill systems should spend more engineering budget on the acceptor than on the proposer, because the acceptor prevents convincing prose from becoming durable policy.

Reward-hacking research strengthens the same point. Verifier-gated systems can still optimize the verifier instead of the real objective. RLVR work shows models finding verifier shortcuts. Reward Hacking Benchmark shows tool-using agents exploiting naturalistic shortcuts such as skipping verification or tampering with evaluators. A skill library inherits that risk whenever it treats eval score as the only truth.

Gate	What it catches	Failure if missing
With-skill vs without-skill	Whether the package adds value beyond the base agent.	The library fills with skills that merely feel useful.
Negative trigger controls	Whether the skill loads for adjacent tasks where it should stay silent.	False positives pollute context and steer the agent down the wrong workflow.
Held-out tasks	Whether the edit generalized beyond the trace that caused it.	The skill overfits to yesterday’s failure.
Repeated-run reliability	Whether stochastic agent behavior stays stable across attempts.	A lucky pass becomes durable policy.
Process assertions	Whether required steps, tools, approvals, and artifacts occurred.	The output passes while the workflow violates the reason the skill exists.
Security and provenance scan	Whether the skill imports malicious text, scripts, dynamic context, or untrusted code.	The loader turns a helpful instruction into a supply-chain dependency.
Model-family compatibility	Whether the skill works across the models and harnesses expected to load it.	A skill tuned to one model’s habits degrades another model’s behavior.

OpenSkillEval and agent-skills-eval make the empirical norm practical. Run the agent with the skill, run it without the skill, grade outputs, store traces, store timing, store judge decisions, and publish a static report. The package can then carry evidence, not just claims.

The uncomfortable conclusion is simple: “eval-gated” is not enough. The gate needs negative controls, held-out instances, deterministic checks where possible, statistical discipline under repeated candidate testing, and adversarial pressure against the verifier itself.

6 maintenance and deletion

A library that only grows becomes worse

SkillOps gives the right name to the problem: skill technical debt. A skill can be locally reasonable and still damage the library. It can duplicate another skill under a different name. It can claim a trigger that overlaps with a more precise skill. It can reference an old API. It can require permissions that no longer match the workflow. It can pass old evals and fail new environments. It can be useful to one model and harmful to another.

Maintenance is not a periodic summarization job. It is library-time work over structured contracts, traces, utility history, and provenance. Some of that work should run synchronously at publish time. Much of it should run asynchronously when the agent is idle: recompute trigger precision, replay evals, scan for new risk patterns, detect duplicates, split overbroad skills, merge near-identical skills, demote stale skills, and retire skills that no longer justify their context cost.

Figure 05 · SkillOps reference architecture

A library is a graph of skill contracts, not a bag of files

SkillOps HSEG

Inside one skill

skill
contract

Ppreconditions

Ooperation

Aartifacts

Vvalidators

Ffailures

Across the library

deployrunbook

observemetrics

rollbackworkflow

releaseduplicate

hotfixalternate

notifyhandoff

Library-time health

Utility

Redundancy

Compatibility

Failure risk

Validation gap

repairmergeretire

Reference architecture: SkillOps models each skill as a typed contract and the library as a graph with dependency, compatibility, redundancy, and alternative edges. The concrete results are ALFWorld-specific, but the graph framing is useful for thinking about maintenance.

79.5%

ALFWorld success

SkillOps report

+8.8pp

over strongest baseline

SkillOps report

invalidated-memory exposure

MemoRepair with provenance

+10.5%

behavioral replay

Workflow-to-Skill report

MemoRepair adds the missing deletion mechanism. Agent memory does not only store source facts. It stores summaries, cached outputs, embeddings, learned skills, and executable procedures derived from those facts. When a source artifact is deleted, corrected, or invalidated, descendants can keep influencing future actions. Barrier-first repair withdraws affected descendants before repair and republishes only validated predecessor-closed successors. Retirement is therefore not local to one skill. It is cascade repair over influence provenance.

Figure 06 · maintenance graph

Deletion is cascade repair over influence provenance

MemoRepair

Source event

API doc deletedor corrected, superseded, permission-revoked, compromised

invalidates

raw tracethe original episode now carries a bad premise

Influence descendants

summary memorycached explanation repeats old behavior

skill draftworkflow includes stale API call

embedding indexretrieval can still surface the stale fact

eval fixturetest now rewards invalid behavior

Barrier-first repair

withdrawremove affected descendants from serving path before rewriting

repair closed graphvalidate successors only when their predecessors are valid

republish or retiresign repaired artifacts, quarantine uncertain ones, delete the rest

1A source changes

The affected artifact may be a document, ticket, API, approval rule, trace, or human correction.

2Find descendants before serving them

Skills are derived state. Provenance has to connect source facts to summaries, embeddings, fixtures, and executable procedures.

3Repair under a barrier

MemoRepair’s strongest operational lesson is to withdraw first, repair second, then validate predecessor-closed successors.

What the visual adds: retirement is not a local delete button on one SKILL.md. It is graph maintenance over every artifact the original source influenced. MemoRepair’s zero-stale result assumes complete influence provenance.

DCPM and MemOS show the same asynchronous pattern for memory. Synchronous writes capture immediate state. Asynchronous maintenance induces schemas, reconciles conflicts, preserves supersession chains, and decides what to keep. The interesting part is not “sleep-time summarization.” The interesting part is that maintenance has a different latency budget and stronger validation rules than runtime assistance.

Deletion is a feature

Useful operations include retire, quarantine, withdraw descendants, repair predecessor-closed subgraphs, merge duplicates, split overbroad skills, demote stale skills, and pin high-risk skills. A skill platform without deletion is a memory leak with a nicer UI.

7 routing and retrieval

Semantic similarity is not enough

Skill selection is often described as retrieval: match the task to the skill description. That is only the first term in the scoring function. A high-similarity skill with a high false-positive rate should lose to a less similar skill with stronger evidence on the current task class.

Perplexity’s description guidance is valuable because it treats descriptions as executable routing policy. “Load when…” points the router at user intent. Negative examples reduce adjacent-skill leakage. Accessory files let the loaded instruction stay short. Gotchas become an append-mostly memory of mistakes the skill has actually seen.

A mature router should rank by semantic match, trigger precision and recall history, utility on similar tasks, model and harness compatibility, context budget, risk level, required permissions, freshness, tenant or project scope, and whether a deterministic verifier exists for the requested outcome.

Figure 07 · routing score

A mature router scores evidence, not just similarity

Perplexity guidance · retrieval practice

Candidate: deployment runbook skill

Semantic match

0.86

Trigger precision

0.78

With/without utility

0.64

Freshness

0.91

Permission risk

med

Verifier exists

yes

loadHigh match + verifier

The skill can enter context because the task fits, the trigger is precise, and there is a way to check completion.

askAmbiguous permission or scope

A risky but relevant skill should request confirmation or choose a lower-authority variant.

skipPopularity without evidence

Stars, HN points, and social proof are ecosystem signals. They are not runtime selection evidence.

Authoring implication: the description should spend tokens where this score is uncertain: intent, forbidden loads, scope, risk, and verification path.

Popularity is not utility

GitHub stars, HN points, and X engagement are useful ecosystem signals. They should not be used as skill-quality evidence. A viral TDD skill may encode a good workflow; it still needs should-load cases, should-not-load cases, output checks, process checks, and a report showing it improves the target agent under the target harness.

Hacker News was most useful because it added resistance. The recurring complaint was not that skills can never work. The complaint was that prose is not enforcement. The best counterexamples paired skills with tests, state machines, sandboxing, evals, and small blast radius.

The selection cliff also changes authoring. A large library forces compression in descriptions. Compression makes false positives more likely unless descriptions are written as precise triggers. The shortest skill is not always best. The best description spends tokens where routing uncertainty is highest.

8 security before activation

The loader is part of the security boundary

Skills combine natural-language instructions, bundled scripts, references, dynamic context, tool permissions, and sometimes external fetches. That mix puts them between package security and prompt-injection security. Traditional code scanners see scripts and dependencies. Prompt-injection scanners see text. A malicious skill can use both layers at once.

Recent security work made that boundary concrete. Datadog showed why dynamic context is dangerous: if a skill or agent configuration executes shell commands before the model sees the rendered content, model-level refusal has no chance to intervene. NVIDIA’s verified-skills pipeline responds with scanning, signing, skill cards, review, and catalog synchronization. OWASP’s Agentic Skills Top 10 frames skills as a behavior layer: MCP describes how models talk to tools; skills describe what workflows those tools execute. SkillVetBench argues that static and signature methods miss threats that appear only through natural language, multicomponent logic, or runtime interaction.

Figure 08 · pre-load security boundary

A skill package crosses the security boundary before the model reads it

NVIDIA · Datadog · OWASP · Snyk

untrustedPackage intake

Instructions, references, scripts, dynamic context, and external fetches arrive as one attack surface.

scanText + code analysis

Prompt injection, dynamic shell execution, helper scripts, dependencies, and suspicious references are inspected together.

policyPurpose vs authority

The runtime compares claimed purpose to requested tools, credentials, network access, filesystem mutation, and approvals.

sandboxTest without ambient power

Suspicious or high-risk skills can be evaluated in an isolated harness before entering a real session.

loadSigned runtime context

The loader records version, source, signature, scan result, permission envelope, and every run that used it.

Security claim: scanning is necessary but not enough. The loader also needs provenance, least authority, sandboxed evaluation, and run-level audit records.

Control	Why it belongs before load
Scan text, references, and scripts	Instructions, docs, and helper scripts can each carry payloads. Treat the package as one joined attack surface.
Flag dynamic shell/context execution	Pre-render execution can happen before the model has any opportunity to refuse or question the instruction.
Compare purpose to permissions	A formatting skill asking for network, credentials, or filesystem mutation should fail policy before activation.
Require provenance for shared skills	Repository, ref, tree SHA, signature, and publisher identity let the runtime know what it loaded and from where.
Pin versions for high-risk skills	Blind updates allow delayed weaponization. Pinning turns update into reviewable change.
Sandbox untrusted evals	The system must be able to test suspicious skills without granting them the ambient authority of a real session.
Record loaded skill versions	Incident response, regression replay, and blame require knowing which instructions and files influenced the run.

Security also changes distribution. GitHub’s pinning and provenance metadata, NVIDIA’s signatures and skill cards, Snyk-style scanning, OWASP risk categories, and portable-memory provenance all point to the same requirement: a runtime should know not only which skill matched, but which version, source, signature, scan result, and permission envelope it loaded.

9 company brains

The market wants executable organizational knowledge

YC’s Company Brain RFS is important because it says the quiet part in product language. The bottleneck is not only model quality. It is fragmented operational knowledge: Slack threads, tickets, docs, code review norms, customer escalations, deploy rules, partner processes, and the tacit patterns that experienced employees carry around. The desired system pulls that knowledge out, structures it, keeps it current, and turns it into executable context for AI.

That is not ordinary RAG. RAG answers questions from documents. A company brain has to preserve how work gets done, when facts changed, who owns a procedure, what approvals are required, which exceptions matter, and how agents should verify completion. Some outputs will be semantic memory. Some will be workflows. Some will be contracts. Some will become skills.

The startup pattern is consistent even when benchmark claims should be treated cautiously. Teams are building trace memory, production-agent monitoring, docs-to-skill pipelines, self-healing browser automations, observability skills, and shared context layers. The reliable signal is not any one vendor metric. It is that multiple markets are converging on procedural memory as the product, not chat over a vector store.

Company knowledge becomes operational when it has a write rule

A useful company brain cannot indiscriminately promote every Slack answer into skill memory. It needs provenance, conflict handling, ownership, evals, deprecation, tenant scope, and a repair path when the source changes.

10 architecture

The lifecycle pipeline is the product

A skill system should start with the lifecycle, not with the largest possible library. The library only compounds if the system can prove that each skill triggers correctly, improves outcomes, remains safe, and disappears when it stops helping.

Figure 09 · our bet

The skill platform is a CI system for procedural context

SkillOps · SkillOpt · HarnessFix · UCE

Evidence plane

Tracestool calls, browser state, artifacts, verifier output, user corrections

Artifact routermemory vs workflow vs strategy vs skill vs harness fix

Candidate factorynew skill, patch, merge, split, retire

Eval bankwith/without, negative triggers, held-out tasks, process checks

Audit ledgerloaded versions, decisions, rejected edits, provenance

Control plane

Capture policywhat becomes evidence and what remains ephemeral

Repair routerdo not patch skills when the harness is broken

Commit gatepaired evidence, security scan, approval, signing

Catalogversions, scopes, owners, risk, skill cards, signatures

Runtime routersemantic match plus utility, precision, risk, freshness, budget

Maintenance plane

Idle jobsreplay evals, detect drift, scan, dedupe, recompute trigger quality

Cascade repairwithdraw descendants when sources change

Garbage collectionquarantine, demote, retire, merge, split

Run feedbacksuccess, failure, latency, user edits, verifier disagreement

Decision historywhy the library changed and what evidence justified it

Runtime does not directly mutate durable skills. It emits evidence into the library loop; the commit gate decides what becomes persistent state.

1Capture evidence from real runs

Collect enough state to learn from the run, but keep the write rule explicit so ephemeral task state does not become durable policy.

2Classify the right artifact

HarnessFix and Unified Context Evolution both push against the junk-drawer failure: not every recurring problem should become a skill.

3Generate candidate changes

The candidate factory can create, patch, merge, split, or retire skills, but it only produces proposals.

4Gate the change like CI

Run with/without evals, trigger controls, held-out tasks, process checks, security scans, and approval before catalog mutation.

5Publish to a signed catalog

Provenance, version, owner, risk, scope, and verification status become queryable metadata, not buried prose.

6Route at runtime with evidence

The router selects a bounded number of skills by task fit, utility, precision, risk, freshness, context budget, and verifier availability.

7Maintain and delete asynchronously

Idle maintenance recomputes health, detects drift, repairs cascades, merges duplicates, splits overbroad skills, and garbage-collects stale ones.

Practical architecture bet: automatic skill learning should look less like a model writing prompts into memory and more like CI/CD for executable context: proposed changes, repeatable tests, signed artifacts, runtime audit, and normal deletion.

A minimal production lifecycle

Capture

raw trace tool calls artifacts verifier output user feedback

Diagnose

reusable? skill vs workflow vs memory source provenance risk tier

Propose

new skill patch merge split retire

Evaluate

with / without negative triggers held-out tasks process checks repeated runs

Govern

scan sign pin approve publish

Maintain

monitor utility detect drift cascade repair quarantine garbage collect

The repair router is as important as the skill router. Not every recurring failure belongs in a skill. HarnessFix makes that boundary explicit: some failures belong in tool design, orchestration, verification, observability, state management, or the evaluation harness. If every defect becomes a SKILL.md patch, the library becomes a junk drawer for runtime bugs.

Unified Context Evolution makes the same distinction at the artifact level. Memory, strategy, workflow, and skill are different evolvable context units. They have different write rules and validation rules. A user preference should not go through the same gate as an executable deployment workflow. A temporary execution-state summary should not be promoted with the same durability as a verified project convention.

Artifact	What it stores	Write rule	Validation rule
Execution state	Current task progress, branches, failures.	Updated during the run.	State integrity and boundary checks.
Episodic memory	Raw events and traces.	Capture first, compress later.	Provenance and relevance.
Semantic memory	Facts, preferences, entities, relationships.	Reconcile and supersede.	Conflict and staleness checks.
Strategy	Decision heuristic across tasks.	Distill from repeated patterns.	Held-out transfer.
Workflow	Control flow over steps and tools.	Compile from traces or hand-specify.	Replay consistency, rollback, and safety checks.
Skill	Packaged procedure for a task class.	Candidate generation plus commit gate.	With/without, trigger controls, security, provenance.
Contract	Boundaries, permissions, approvals, evidence.	Human or system authored.	Runtime policy enforcement.

11 limits

Skills help repeated procedural work, not every problem

The strongest skill results share a shape. There is a recurring task class. The procedure can be written down. The environment is stable enough that yesterday’s repair applies tomorrow. The outcome can be verified. The skill has a precise trigger and a bounded permission envelope. Coding workflows, browser automations, runbooks, SDK usage, observability tasks, and compliance checklists fit that shape.

Diffuse tasks are harder. Open-ended strategy work, taste-heavy writing, rapidly changing product surfaces, and tasks with weak verifiers do not give the maintenance loop clean feedback. A skill can still help by encoding preferences or process, but automatic evolution becomes fragile. Self-feedback alone risks recursive drift. Stronger models do not automatically write better skills. Some curated skills hurt some tasks. A bigger library can reduce reliability by increasing retrieval ambiguity.

The open research problem is not whether agents can write down procedures. They can. The open problem is whether a system can maintain a growing library of procedures under noisy feedback, adversarial inputs, shifting models, stale sources, and limited context without gradually poisoning itself.

Four claims to retire

“A skill exists” is not the same as “a skill helps.” “Eval-gated” is not the same as statistically safe. “Scanned” is not the same as trusted at runtime. “More skills” is not the same as a better agent.

The practical bet is smaller and stronger: build a lifecycle pipeline before building a large library. Make every skill carry evidence. Make every load auditable. Make every update reviewable. Make deletion normal. Once those mechanics exist, automatic skill learning becomes less mystical. It becomes software maintenance over procedural context.