I gave this talk at DevOpsCon Amsterdam on April 23, 2026. What's below is the prose version of the slide notes. The actual deck is the embedded thing above (arrow keys to step through, o for the grid view). Read this on its own, or follow the slides side by side, your call.

Here's the one-liner I opened with: AI agents fail inside real infrastructure for one reason, the model has never seen your system. The model itself is competent. The gap between competent generalist and useful on-call is what this talk calls context engineering.

The gap between LLMs and production

Two clusters of nodes side by side. On the left, four cyan nodes labelled frontend, api, cache, database connected by solid edges, captioned "what the LLM knows." On the right, six unlabelled nodes with dashed terracotta rings, each containing a question mark, captioned "your infrastructure." A few dashed edges bridge the gap between the two clusters.

LLMs are excellent at general reasoning. Your infrastructure is specific. It was named in 2021 by an engineer who has since left, the metric you actually need is called something nobody documented, and the dependency that brought down checkout last quarter exists in zero diagrams.

The talk walks through three concrete failure modes that compound the moment an agent tries to investigate a production incident.

Failure mode 1: naming chaos

Four name boxes converge with dashed terracotta arrows on a single cyan node labelled "one service." The four names are payments-api-prod-eu-west-1-v2 (EC2 instance), pay_api_prod_euw1 (RDS instance), payments-prod (Kubernetes namespace), checkout-svc (Kubernetes service).

One service, four names: payments-api-prod-eu-west-1-v2 (the EC2 instance), pay_api_prod_euw1 (the RDS database behind it), payments-prod (the Kubernetes namespace), and checkout-svc (the service fronting both). On-call knows those are the same thing, but the agent does not unless we teach it.

This matters because the rest of the talk uses one running example: someone reports "5xx errors on payments." That sentence maps to all four resources at once. The word payments is not a key the agent can look up.

Failure mode 2: telemetry drift

An agent searches the term "errors" and gets three terracotta cross marks against three differently-named telemetry metrics: http_requests_total{status=~"5.."} (platform), payments_failures_nb (payments, the one we needed), and checkout.request.failed (checkout).

Three teams, three telemetry conventions. The platform team exposes a Prometheus counter (http_requests_total{status=~"5.."}), while the payments team has rolled their own counter named payments_failures_nb, and checkout is on StatsD with a metric called checkout.request.failed.

A generic agent searches for "errors" across this estate. Zero matches. Any company past around fifty engineers has this problem. The signal that says "payments is failing" lives in payments_failures_nb, and the agent has to learn that mapping. It cannot guess.

Failure mode 3: invisible dependencies

A three-step chain across teams. On the left, a cyan node labelled "5xx spike payments" under the payments team caption (symptom). In the middle, a dashed terracotta-ringed node labelled "cache" under the auth team caption (undocumented dependency), connected to the left by a dashed eviction line and a solid arrow back labelled "cache hits." On the right, a solid terracotta-filled node containing "commit a7f3d2e" under the platform team caption (root cause, fourteen minutes ago).

The 5xx on payments is not actually about payments. The payments service depends on an auth cache. The auth cache started evicting hot keys after commit a7f3d2e tweaked its eviction policy fourteen minutes ago. Auth lookups now miss, the backend gets crushed, and payments returns 5xx.

Three hops, three teams, zero documentation across the chain. The payments SRE getting paged has never been told the cache is hot-key-sensitive (that detail lives in an auth-team head, not the wiki). Meanwhile the platform engineer who shipped a7f3d2e is in a different Slack channel entirely, completely unaware their change has just paged on-call.

This is the chain we walk five times in the talk. Topology, not telemetry, is what unblocks an investigation like this. Symptoms scattered across logs do not add up to a story. A graph does.

Layer 1: structural context

A horizontal timeline with cyan tick marks for individual deploys, one tick highlighted in terracotta and labelled "commit a7f3d2e, cache eviction policy." A play head on the right is labelled "now, 22:07 UTC, payments 5xx." A dashed curved arrow from now points back to the highlighted commit, captioned: "what changed in the last six minutes?"

Context engineering, in the framing I'm using here, isn't a one-time setup. It's something you build and maintain. Two layers do the work. The first one maps what's actually there in your environment, with edges, time, and provenance. The second one captures everything those nodes mean inside your specific org, which is not a thing that survives import from another company's wiki.

The first layer is a graph synced from your environment, with history. Services, deploys, commits, cloud resources, Kubernetes objects, Terraform state. You can ask the graph what existed at 22:01 UTC, six minutes before the page fired. You can also ask which commit landed in that window and which resources it touched.

The shape of the investigation depends on whether the agent has this graph at all. Telemetry tells the agent what is happening. Topology tells it where, and what else is wired to that where.

Layer 2: learned context (ACE)

A circular feedback loop with four nodes connected by clockwise dashed arrows. Top: "graph context" (cyan). Right: "agent run" (tan). Bottom: "investigation trace" (terracotta). Left: "reasoner extracts" (forest green). At the centre of the loop, the label "ACE, no labels, no retraining."

The second layer is ACE, Agentic Context Engineering. A method for the agent to learn this specific infrastructure without labels, by extracting signal from its own investigation traces rather than from outcomes nobody annotates.

Over time the learned layer builds up an org-specific cheat sheet. The 3am Redis story above is one example: once the agent has run into it across five separate investigations, it gets written into the pattern set, so the next on-call doesn't figure it out from scratch.

The feedback architecture is a loop. The graph grounds each agent run, each run produces a trace, a reasoner extracts reusable fragments from the trace, and those fragments become learned context for the next run. The agent gets better on this specific infrastructure over time, with no retraining and no model updates: the same model with progressively richer context.

Annie investigates: the scripted replay

On stage I switch to Annie. The question goes in: "what's happening with payments?" Annie traverses the graph (no invented metrics) and pulls a learned pattern. Four lines come back: symptom, dependency, commit, blast radius. Thirty seconds.

Then the punchline. The author of a7f3d2e is me, fourteen minutes ago. The room laughs. Annie offers to revert and opens the PR with the right reviewer attached. It will not merge it itself.

Promoting patterns: the safety story

A horizontal pipeline showing how draft patterns become live patterns. From left to right: "draft pattern" (terracotta dashed box, "not trusted yet"), an arrow labelled "7 days dwell," "candidate pattern" ("earning its way in"), an arrow labelled "≥5 cross-trace matches," "human review" ("PR-style, diffable"), an arrow labelled "approved," and "live pattern" (cyan solid, "a human approved"). A side branch from the candidate stage shows "mis-fires dropped."

Every investigation produces a draft pattern. Drafts are not live and not trusted yet. To survive into the live set, a draft has to dwell for roughly seven days and match across at least five independent traces. Mis-fires get dropped. Patterns earn their way in.

Live patterns are reviewed like code: diffable, revertible, scoped to their source. Payments-team patterns never silently apply to platform-team services. Every live pattern is one a human approved.

Limits of the approach

The honest list, four things this does not solve.

A failure that has never been seen before has no prior trace, so no pattern. The first occurrence takes the full investigation walk every time.

Infrastructure churns faster than patterns get promoted, so old rules stop matching after a while. Pruning the pattern set is real work and there is no shortcut for it.

ACE is not a replacement for observability. With weak signals coming in, the agent produces confidently-wrong answers faster. The substrate has to be there.

A freshly-formed team's patterns take weeks to surface, so onboarding does not get the shortcut, at least not for the first month.

These are limits of the approach. None of them is followed by "but the product solves them." Naming your own weaknesses before Q&A is the cheapest credibility move a speaker has.

What I want you to leave with

One thing to take back. Your bottleneck is the model not knowing your stack, not the model itself. The agent has to learn what your services are, what depends on what, what your team's playbook says about Redis at 3am. None of that can come from training. It has to come from inside your environment. The talk is one architecture for getting it there, and the setup at our booth is exactly that architecture pointed at a customer account.

If you want to see the graph and the agent in action against your own infrastructure, the setup is around thirty minutes, read-only IAM, and no agents in production. We are at the Anyshift booth this week, or you can find me after the talk.