2026-05-18 10:59 UTCIn-site rewrite6 min readUpdated: 2026-06-27 00:25 UTC

Agent Skills Work but the Research Shows Most Teams Are Building Them Wrong

Recent research reveals the real effectiveness of agent skills: curated skills improve task completion by 16.2% on average, while self-generated skills show no consistent benefit. As skill libraries grow, flat retrieval breaks down, making hierarchical organization critical. Additionally, more than one in four community skills contain exploitable vulnerabilities.

SourceO'Reilly AI & ML RadarAuthor: Aishwarya Naresh Reganti, Prahitha Movva and Kiriti Badam

This post was originally published on The Nuanced Perspective and is being reposted here with the authors’ permission.

Agent skills are everywhere right now. Atlassian built them into Rovo so agents can automatically triage Jira tickets, draft Confluence pages, and route service requests without anyone typing a prompt. Canva and Figma use them so Claude can interact with design files directly. Stripe published skills for payment workflow automation. When Anthropic launched the Agent Skills open standard in December 2025, Microsoft adopted it in VS Code and GitHub within weeks.

The idea is elegantly simple. Instead of building a new specialized agent for every use case, you write a skill once, and any agent that understands the standard can use it. A code reviewer, a PR generator, a deployment checklist, a sprint planner. Each lives in a folder, triggers when relevant, and brings your team’s specific way of doing things into the agent’s context.

But the research on whether skills actually work, and what causes them to fail, is only catching up to adoption now. Four recent papers take the first systematic look at skills in practice: what the benchmarks show, how libraries break down as they grow, and what a more principled approach to orchestration looks like.

Three findings that will change how you think about skills:

Curated skills raised the rate at which agents successfully completed tasks by 16.2% on average across 84 tasks. Model-written skills showed no consistent benefit across any configuration tested.

As skill libraries grow, the agent’s ability to find the right skill on demand breaks down. When it scans every skill description in one pass, similar-sounding skills start colliding. Organizing skills into a hierarchy rather than a flat list is what the research shows actually fixes this.

A large-scale security study of ~31K community skills found that more than one in four contain exploitable vulnerabilities, spanning prompt injection, data exfiltration, and privilege escalation.

This is what those papers found, and what it means for anyone building with skills today.

What a skill is

Your team has a specific way of reviewing PRs. Particular checks, a specific order, standards that go beyond what any generic reviewer would know. You’ve explained it to every new engineer who joined. A skill is how you stop explaining it and let the agent carry it instead. In practice it’s a folder with a SKILL.md file at the center: a description that acts as the trigger condition, a body with step-by-step instructions, and optionally scripts and reference documents that load only when needed. A scoped set of tools and instructions the agent can invoke.

At session startup, the agent reads only the name and description from each installed skill, which is about 100 tokens per skill. The full instructions load only when the skill activates, and scripts run without being read into context at all. A large skill library costs almost nothing at initialization. The context budget only gets spent when a skill is actually running.

That’s progressive disclosure, and it’s what makes skills different from system prompts, which load everything globally every session, or tools, which are API calls that give the agent direct capabilities. The distinction that holds up for MCPs is that MCP gives the agent abilities, say, a shell, an API connection, or access to a database, whereas skills encode the knowledge of how to use those abilities well for a specific workflow. Block’s engineering team put it well that skills are like GitHub Actions YAML, and MCP is the runner. One describes the workflow and the other makes it possible.

Some concrete examples of what this looks like in practice, from teams that have shipped skills in production:

A PR review skill that loads your org’s specific style guide, flagging violations and blockers according to your team’s standards rather than generic best practices

A deployment checklist skill that runs your team’s exact predeploy sequence, covering environment checks, rollback verification, and the three Slack channels to notify in order

A data reporting skill that knows your company’s metric definitions, so when someone asks for “revenue,” it pulls the right number rather than the closest approximation

A sprint planning skill that fetches the backlog, applies your team’s capacity rules, and proposes a plan structured the way your team runs standups

The value in each of these isn’t the task itself. Any agent can attempt a PR review or a sprint plan. The value is the organizational knowledge baked into how the skill executes it, your style rules, your deploy sequence, your metric definitions, your team’s way of running things. That specificity is also what makes skills hard to get right, as the benchmarks show.

What the benchmarks show

SkillsBench is the first benchmark built specifically to measure whether agent skills actually improve performance. It tested 84 tasks across 11 domains, running each task under three conditions: no skill, a curated skill, and a self-generated skill. The results are worth sitting with.

Curated skills raised average pass rates by 16.2%. However, the gains were uneven across domains. Software engineering tasks improved by 4.5%, while healthcare tasks saw nearly 52% improvement. The domains where skills helped most were the ones with highly structured workflows and domain-specific conventions the base model doesn’t carry natively.

The less-cited result is that self-generated skills, where the model writes its own skill rather than a human curating one, provided no average benefit across configurations (“SkillsBench,” Table 3). Some model configurations saw small gains; others saw small losses. The paper’s conclusion was that models cannot reliably author the procedural knowledge they benefit from consuming. The trajectory analysis in the benchmark identified two failure modes:

Models either generate imprecise procedures lacking specific API patterns, or

Fail to recognize what domain knowledge the task actually requires.

The benchmark’s self-generation condition has also drawn pushback from practitioners. One engineer writing on HackerNoon argues the test doesn’t reflect how skilled teams actually build skills. The benchmark prompted a fresh agent to write a skill and immediately use it, which is closer to asking a model to think harder before attempting a task than to building a skill from real execution experience. His own replication, using skills built from actual debugging sessions, showed much stronger results. The distinction matters because a skill captures what a fresh model wouldn’t know. If the model could have reasoned its way there anyway, the skill wasn’t needed.

The practical consequence is that self-generation is the obvious shortcut. You finish a workflow, ask the agent to extract it as a skill, and move on. The benchmark says that without a human review step, you’re not getting the gains you’d expect. The skills look complete. They often cover the main path. What they miss are the edge cases, the exceptions, the three things your team does differently that the model has no way of knowing, and those are exactly the things that make a skill valuable.

One finding worth noting for anyone building with skills: focused skills with two to three modules consistently outperformed comprehensive documentation (“SkillsBench,” Section 4.2). More coverage in a single skill didn’t help; more focused, well-scoped skills did. The benchmark also found that smaller models running with curated skills could match larger models running without them, which is a meaningful cost implication for anyone running skills at scale (“SkillsBench,” Section 4.2.3, Finding 7).

Questions that come up when building with skills

These questions show up every time a team starts building a skill library.

When does something become a skill versus staying in a workflow or system prompt? The cleaner test is whether this is a recurring task that your team has a specific, repeatable way of doing. If yes, it’s a skill candidate. If it’s a one-time flow or something where general reasoning is sufficient, it probably doesn’t need one. The key difference between a skill and a workflow tool like n8n is flexibility. A workflow executes a fixed sequence and breaks when inputs change, while a skill gives the agent procedural guidance it can apply to variations of the same task. Similarly, agentic workflows can chain multiple agents and tasks together, but each agent still benefits from skills that encode the org-specific knowledge for its part of the chain. When you want the what to be consistent but the agent to handle the how intelligently, that’s a skill.

How narrow or broad should a skill be? The SkillsBench finding that focused skills with two to three modules outperform comprehensive ones is directly relevant here (“SkillsBench,” Section 4.2). A skill that tries to cover an entire domain tends to underperform one that handles a specific thing well. The more practical question is whether to put a full workflow (data fetch, format, generate PDF) into one skill or split it. Current research supports splitting because, then, each piece becomes reusable, easier to update when something changes, and less likely to create unexpected behavior when one module’s scope drifts.

What about skills for noncoders or nonsoftware workflows? Skills are format-agnostic. They’re structured instructions plus optional scripts, and the domain can be anything. A customer support team can encode their escalation criteria, tone guidelines, and the specific conditions where a human always takes over. A legal team can encode their document review checklist. A design team can encode component standards so reviews stay consistent across contributors. Atlassian’s Rovo agents are a useful reference outside the coding context. Their skills handle ticket triage, Confluence page creation, and service request routing, none of which is software engineering.

When should you deprecate a skill? This is the question that gets skipped most often. The “SoK” paper argues for treating skills like any other maintained artifact through discovery, refinement, evaluation, update, and eventually deprecation (see Figure 2 in the paper). A skill that was compensating for a model capability gap six months ago may now be redundant, and worse than redundant if it’s overriding better native behavior. The practical test is to run the task with and without the skill and check if the skill still helps. If the gap has closed, retire it.

What breaks as the library grows

A single well-written skill works well. As libraries grow, flat retrieval breaks down, and the “AgentSkillOS” paper is the first to study this systematically across ecosystem scales from 200 to 200,000 skills.

Flat skill libraries don’t scale. When the agent scans a flat directory of, say, 80+ skills on every request, retrieval becomes unreliable. Two skills with similar descriptions start triggering interchangeably and behavior becomes nondeterministic for the same input. At the extreme, the orchestrator falls into routing collapse, where it consistently invokes the wrong skill because the semantic embeddings of two similar skills are indistinguishable. The output looks reasonable BUT the wrong skill ran.

The fix the paper proposes is capability trees: organize skills into a hierarchy rather than a flat list. Top-level domains like code, data, docs, with more specific skills as branches and leaves. The agent navigates from domain to branch to leaf instead of scanning everything. They also introduce a usage frequency queue, where skills that aren’t being invoked or aren’t improving outcomes get moved to a dormant index so they don’t pollute retrieval for active skills.

Testing this across ecosystems ranging from 200 to over 200,000 skills, the structured approach consistently outperformed flat

[truncated for AI cost control]