March 26, 2026 4 min read 2026

The 10-Hour Skill Beats the 10-Minute Skill Every Time

I thought a single SKILL.md file was enough. Then I saw how Anthropic's own team structures theirs, and rebuilt everything.

I thought writing a Skill meant dropping a SKILL.md file into a folder and moving on. Ten minutes, done. That worked fine until I watched the same mistakes repeat across every invocation and realized I had no way to tell whether the Skill was actually doing what I intended.

Then Thariq, one of the engineers building Claude Code at Anthropic, posted something that reframed the whole thing: “Using Skills well is a skill issue.”

That line stuck because it matched what I was seeing. The gap between a quick markdown file and a properly structured Skill folder was showing up in real output quality, not just in theory.

A Skill is a folder, not a file

The most common misconception is that a Skill equals one SKILL.md file. In practice, a Skill is a folder containing scripts, reference code, configuration, and the markdown file that ties them together.

Anthropic’s internal approach uses what they call progressive disclosure. Instead of cramming everything into a single prompt, they arrange files so Claude reads only what it needs at the moment it needs it. A references/api.md file holds function signatures that Claude pulls in on demand. An assets/ directory contains output templates so the prompt never has to describe formatting. Validation scripts let Claude test its own output before returning it.

If you open the skill-creator repo, you’ll see this principle in action. The agents/, references/, and scripts/ directories sit alongside the SKILL.md. The tool that builds Skills is itself built as one.

Gotchas matter more than the prompt body

Thariq called the Gotchas section the “highest-signal content” in a Skill. Not the main instructions, not the examples. The Gotchas.

This tracks with my experience. I built a Skill without a Gotchas section and hit the same error three times in a row. The moment I added a line documenting that specific failure pattern, it stopped happening.

The reasoning is straightforward. Claude already knows most of what you’d write in the prompt body. Telling it how to write TypeScript or format JSON is restating things it handles well by default. But telling it what not to do in your specific context is genuinely new information.

A few principles from Thariq’s post that I’ve found reliable: don’t state the obvious because redundant instructions can actually degrade performance; avoid railroading with overly specific steps because it kills Claude’s ability to adapt; and remember that the description field isn’t documentation for humans, it’s the input Claude uses to decide when to trigger the Skill.

Skill Creator turns “seems to work” into “verified”

The Skill Creator update from two weeks ago changed how I think about Skill quality. You define test prompts, set expected outcomes, and the tool verifies whether the Skill actually produces correct results. It’s unit testing for prompts.

I added evals to a Skill I’d been using for weeks. Two test cases I assumed would pass failed immediately. The fixes were small, but the output quality shifted noticeably once I applied them.

There’s a useful distinction between two types of Skills. Capability uplift Skills teach Claude something it can’t do well on its own. Encoded preference Skills enforce a team’s specific workflow or standards. The first type has a natural expiration date because model improvements eventually make it unnecessary. The second type stays valuable as long as the workflow exists. Evals help you catch the moment a capability uplift Skill becomes dead weight.

The tooling supports benchmark mode for tracking pass rates and token usage across model updates, multi-agent parallel execution to avoid context contamination during testing, and a comparator agent that runs blind A/B comparisons of output with and without the Skill applied.

The compound return

Across the hundreds of Skills I’ve seen and the dozens I maintain, one pattern holds: the value of a Skill comes from iteration, not from the initial draft.

The folder structure is how you shape Claude’s context window. Gotchas convert your failures into reusable knowledge. Evals measure whether that knowledge still holds.

Writing a SKILL.md takes ten minutes. Adding Gotchas from real failures, building eval cases, and including validation scripts takes closer to ten hours. That investment pays back every time the Skill runs. Set one up tonight. By morning, it will have done work you didn’t have to.

Join the newsletter

Get updates on my latest projects, articles, and experiments with AI and web development.