An AI agent in the incident channel can already do the mechanical parts of a page. It greps the logs, runs kubectl get pods, pulls the Grafana panel. What it can't do is the part an on-call learns the hard way, paged at 3am: which signal to read first, whether the deploy from twenty minutes ago is the cause or a coincidence, and when to stop digging and wake a human.

That judgment is the actual job, and it doesn't fit in a system prompt.

So we started writing it down as skills an agent can load. sre-skills is an open library of methodology-shaped SRE skills, Apache-2.0 and vendor-neutral. Five of them: investigate a live incident, analyze change impact before a risky apply, hand over on-call, write a postmortem, audit production readiness. None of them need a vendor account, credentials, or even our product. Each one runs end to end against fixtures checked into the repo.

Here's the honest part: one of the five is actually finished. The other four are scoped, with the structure in place and the methodology still to fill in. The one that's done is incident-investigator, and we wrote eleven worked incident scenarios for it before we trusted it to ship as the reference template the others copy.

Methodology-shaped means the skill isn't a list of kubectl commands. It's the decision procedure. Correlate the recent deploys before you stare at the metrics. Treat the most recent change as the prime suspect until the evidence clears it. Here are the failure modes a real investigation rules out, and here's the point where you stop and page someone instead of guessing. The commands are the easy part. The order you run them in, and what you read from each output, is the part that took years to learn.

Run it without touching production

You can watch it work without touching a production system. Clone the repo, point an agent at incident-investigator, and it investigates against the fixtures in the repo: a set of deploys, logs, metrics, traces, and pod events that describe one failing service. No credentials, no setup, fully reproducible! You read exactly how the agent reasons through the incident before you ever trust it near your own stack. Then you fork the template and swap the fixtures for your own runbooks.

Where to get it

It installs as a Claude Code plugin:

/plugin marketplace add anyshift-io/claude-plugins
/plugin install sre-skills@anyshift

For any other agent, it's a plain git clone of `anyshift-io/sre-skills`. The skills are Markdown with fixtures next to them, so anything that can read files can load them.

What's still hard is the methodology, not the plumbing. Writing the four remaining skills means pinning down the judgment a good on-call makes without thinking, in enough detail that an agent can follow it and you can audit where it went wrong. That's the work. The repo is open for forks and scenarios: if you've got an incident shape the investigator should handle and doesn't, that's the contribution we want!