This is a written replay of the talk I gave at DevOpsCon Amsterdam on April 23, 2026. The full deck is embedded above: use arrow keys to step through, press o for the overview grid, or open it in a new tab. The article below adapts the slide notes into prose, in roughly the same order, so you can read it standalone or follow along with the slides.
The thesis in one sentence: AI agents fail inside real infrastructure not because the model is bad, but because the model has never seen your system. Closing that gap is a discipline. It has a name now: context engineering.
The gap between LLMs and production
LLMs are excellent at general reasoning. Your infrastructure is specific. It was named in 2021 by an engineer who has since left, the metric you actually need is called something nobody documented, and the dependency that brought down checkout last quarter exists in zero diagrams.
When a generic agent meets a real system, the failure mode is not hallucination. It is plausible-but-wrong. A confidently-named metric that does not exist. A "service" that is actually four resources spread across EC2, RDS, and Kubernetes. A root cause attributed to the wrong commit because the agent could not see the dependency that links them. Plausible-but-wrong is worse than wrong: you do not catch it.
The talk walks through three concrete failure modes that compound the moment an agent tries to investigate a production incident.
Failure mode 1: naming chaos
One service, four names. payments-api-prod-eu-west-1-v2 is an EC2 instance. pay_api_prod_euw1 is the RDS database behind it. payments-prod is a Kubernetes namespace. checkout-svc is the Kubernetes service that fronts both. On-call knows these are the same thing. The agent does not, unless we teach it.
This matters because the rest of the talk uses one running example: someone reports "5xx errors on payments." That sentence maps to all four resources at once. The word payments is not a key the agent can look up.
Failure mode 2: telemetry drift
Three teams, three telemetry conventions. The platform team exposes a Prometheus counter, http_requests_total{status=~"5.."}. The payments team has a custom counter named payments_failures_nb. Checkout uses StatsD with a metric called checkout.request.failed.
A generic agent searches for "errors" across this estate. Zero matches. Any company past around fifty engineers has this problem. The signal that says "payments is failing" lives in payments_failures_nb, and the agent has to learn that mapping. It cannot guess.
Failure mode 3: invisible dependencies
The 5xx on payments is not actually about payments. The payments service depends on an auth cache. The auth cache started evicting hot keys after commit a7f3d2e tweaked its eviction policy fourteen minutes ago. Auth lookups now miss, the backend gets crushed, and payments returns 5xx.
Three hops, three teams, zero documentation. The payments SRE does not know the cache is hot-key-sensitive. The auth team does not know about a platform commit. The platform engineer who shipped a7f3d2e has no idea their change is paging on-call right now.
This is the chain we walk five times in the talk. Topology, not telemetry, is what unblocks investigations like this. Symptoms scattered across logs do not add up to a story. A graph does.
Layer 1: structural context
Context engineering, in this framing, is a discipline rather than a one-off configuration step. It splits cleanly into two layers: a structural layer that maps what exists, and a learned layer that captures what those things mean inside this specific organization.
The structural layer is a continuously-synced, time-aware graph of services, deployments, commits, cloud resources, Kubernetes objects, and Terraform state. The load-bearing word is versioned. A versioned graph lets you ask: what did the world look like at 22:01 UTC, six minutes before the incident? Which deployment changed between 21:55 and 22:01? Which resource did that deployment touch, and what depends on it?
Without topology, an investigation looks like: check logs, scale up, page the platform team, drift toward the wrong cause. With topology, it is one walk: symptom → dependency → commit → blast radius. There is no "author" node in the graph. The goal is to find the broken change, not to blame a human.
This is why the same incident produces a different shape with and without topology. Same input, different output. Telemetry tells you what is happening. Topology tells you where it is happening, and what it is connected to.
Layer 2: learned context (ACE)
The graph tells you that payments depends on a cache that depends on Redis. It does not tell you that at 3am, when Redis looks unhappy, you should check the connection-pool configuration before chasing CPU. That runbook lives in heads, in wikis, in old Slack threads. Nodes are not enough. The agent needs to know what they mean in this org.
That is the second layer: ACE, Agentic Context Engineering. A method for the agent to improve on this specific infrastructure, without labels.
The "without labels" part is load-bearing. Most ML approaches assume you have labeled outcomes. Infrastructure does not. Post-mortems are sparse. "Resolved" in PagerDuty does not mean "root cause found." Production does not pause to annotate itself. So ACE extracts signal from the process rather than the outcome. Each investigation produces a trace: what the agent queried, what it found, which paths it walked, what it concluded. Patterns that hold across many traces get promoted into reusable context.
What gets learned, concretely, falls into three buckets:
- Naming patterns: which conventions this org uses.
*-prod-*means production. The numeric suffix onpay_api_prod_euw1means region. - Telemetry mappings: where each team's "errors" actually live. Payments errors are in
payments_failures_nb, not in any metric containing the word "errors." - Investigation strategies: when Redis looks unhappy at 3am, check pool config before CPU. Eighty percent of the time, that is where the cause lives.
The feedback architecture is a loop. The graph grounds each agent run. Each run produces a trace. A reasoner extracts reusable fragments. Fragments become learned context, fed into the next run. The agent gets better on *this* infrastructure over time. No retraining, no model updates. The same model with progressively richer context.
Annie investigates: the scripted replay
In the talk, this is where I run a deterministic replay against Annie, the agent we build at Anyshift. On-call asks, in plain language, what is happening with payments. Annie grounds against the graph (no hallucination, just traversal), applies a learned pattern ("for 5xx on payments, recent commits on dependencies have explained 14 of 23 prior incidents"), and returns four lines: symptom, dependency, commit, blast radius. About thirty seconds, against the historical twenty minutes with multiple people on a war-room bridge.
The author of a7f3d2e turns out to be me, fourteen minutes ago. (Reliably gets a laugh in a room full of platform engineers.) Annie offers to revert, draws a flowchart of the dependency, computes the blast radius drop from twelve services to zero, and opens the PR with the right reviewer attached. When asked to merge it, it refuses. Guardrail in the wild: no silent merges, ever. After resolution, the investigation gets promoted to patterns.yaml. The demo itself becomes tomorrow's pattern.
Promoting patterns: the safety story
Every investigation produces a draft pattern. Drafts are not live and not trusted yet. To survive into the live set, a draft has to dwell for roughly seven days and match across at least five independent traces. Mis-fires get dropped. Patterns earn their way in.
Live patterns are reviewed like code. Diffable, revertible, scoped to their source. Payments-team patterns never silently apply to platform-team services. Every live pattern is one a human approved.
That is what makes "learning system running in production" safe to say in a room of platform engineers. Three guardrails make it concrete:
1. Scope by source. A pattern learned from payments traces stays in the payments scope unless explicitly broadened.
2. Version like code. Patterns are diffable, reviewable, revertible. No opaque weights, no silent updates.
3. Fail loud. When the agent cannot ground a claim, it says so. Plausible-but-wrong becomes impossible by construction. This is what closes the loop on the failure thesis from the start of the talk.
Limits of the approach
The honest list. Four things this does not solve:
- Novel shapes. A failure that has never been seen before has no prior trace, so no pattern. The first occurrence takes the full investigation walk.
- Pattern decay. Infrastructure churns faster than patterns get promoted. Old rules stop matching. Pruning is real work.
- Not a replacement for observability. With weak signals coming in, ACE produces confidently-wrong answers faster. The substrate has to be there.
- New-team lag. A freshly-formed team's patterns take weeks to surface. Onboarding does not get the shortcut, at least not at first.
These are limits of the approach. They are not "but the product solves them." Naming your own weaknesses before Q&A is the strongest thing a speaker can do.
Takeaway
The goal is not a smarter model. It is an agent that knows your system.
That is the whole talk in one line. The graph gives you the substrate. ACE gives you the layer that makes the substrate meaningful in this specific organization. Together they turn an investigation that used to take twenty minutes and three people into one that takes thirty seconds, one Slack thread, and a single PR.
If you want to see the graph and the agent in action against your own infrastructure, the setup is around thirty minutes, read-only IAM, and no agents in production. We are at the Anyshift booth this week, or you can find me after the talk.
