How we turned on-call judgment into skills an AI agent can load

An AI agent in the incident channel can already grep the logs and run kubectl get pods. It can read the error rate straight off a Grafana panel. Where it falls down is the judgment part: deciding whether the deploy from twenty minutes ago is the actual cause or a coincidence, and knowing when the dig has gone far enough to wake someone up.

That judgment is the actual job, and it doesn't fit in a system prompt.

So we started writing it down as skills an agent can load. sre-skills is an open library of methodology-shaped SRE skills, Apache-2.0 and vendor-neutral. Five of them:

investigate a live incident
weigh change impact before a risky apply
hand over on-call
write a postmortem
audit production readiness

You don't need a vendor account or our product to run any of them, and each one runs end to end against fixtures checked into the repo.

Here's the honest part: one of the five is actually finished. The other four are scoped, with the structure in place and the methodology still to fill in. The one that's done is incident-investigator, and we wrote eleven worked incident scenarios for it before we trusted it to ship as the reference template the others copy.

Methodology-shaped means the skill carries the order of operations, not just the commands. The retry-storm fixture is a good tell. payments-api is throwing errors, and the most recent deploy is an invoice-formatting refactor whose own commit message swears no behavior change to ledger client. A naive agent goes straight to the latency graph. The skill goes to the deploy history first and holds that innocent-looking refactor as the prime suspect until something clears it, since a change that landed right before the errors earns suspicion a flat graph never will. The same logic runs the exits: which failure modes would let the deploy off the hook, and when to quit guessing and page a human. That ordering is most of the work.

Run it without touching production

You don't need a production system to see it work. The repo ships a snapshot of one failing service, the deploys and logs and metrics it had during a real incident, and incident-investigator works that snapshot the way it would a live page. No setup, no credentials, so you can read its whole chain of reasoning before you trust it anywhere near your own stack. When it's useful, fork the template and drop your own runbooks in where ours are.

Where to get it

It installs as a Claude Code plugin:

/plugin marketplace add anyshift-io/claude-plugins
/plugin install sre-skills@anyshift

For any other agent, it's a plain git clone of `anyshift-io/sre-skills`. The skills are Markdown with fixtures next to them, so anything that can read files can load them.

The hard part of the four remaining skills isn't the plumbing. It's pinning down the judgment a good on-call makes without thinking, in enough detail that an agent can follow it and you can audit where it went wrong. The repo is open for forks and scenarios. If you've got an incident shape the investigator should handle and doesn't, that's the contribution we want!

How we turned on-call judgment into skills an AI agent can load

Run it without touching production

Where to get it

Louis Fradin

Stay ahead of downtime.