How do I find the commit that caused a production incident?

Walk back through every change that touched the affected service or its runtime dependencies in a tight time window starting from the alert. The mechanical sequence is: bound the change window from the alert timestamp, eliminate the application deploy with kubectl rollout history, query the cloud audit log (CloudTrail on AWS) for infrastructure events on the affected resource and its neighbors in that window, then jump to the IaC repository and walk the git log filtered to the same window. Each command is shown below with its output.

Why is the commit that caused an incident often outside my service's repo?

A networking module update, a Kubernetes node pool rotation, or a managed-database parameter group revision can all surface as application-layer incidents in a service whose own repo has not changed in weeks. Tracing only through the affected service's git log misses these. The relevant change set is the union of every change that touched the service or any of its runtime dependencies in the incident window. CloudTrail (or its GCP / Azure equivalent) is the catch-all source for that union when no other change feed is in place.

What is a deployment marker and why does it matter for incident tracing?

A deployment marker is a timestamped record posted from CI/CD that a specific commit reached a specific environment. The Datadog Events API, Honeycomb Markers, and Sentry Releases all expose marker primitives. With markers in place, the same investigation that takes four commands without them collapses to one timeline query. Without markers, you reconstruct the deploy time by hand from CloudTrail, CI logs, and container image digests, which adds 10 to 30 minutes per incident.

How do I filter CloudTrail to the events that matter for incident tracing?

Use --lookup-attributes with AttributeKey=EventName paired with the API actions that modify the resource type you suspect (AuthorizeSecurityGroupEgress, ModifyDBInstance, AttachRolePolicy, etc.) and a tight --start-time / --end-time window matching your change window. The free CloudTrail console only retains 90 days of management events; for longer history or more flexible queries, route the same events to CloudWatch Logs Insights or Athena.

How to Trace a Production Incident Back to the Commit

The page came in at 14:32 UTC on a Friday morning. Error rate on checkout-api had jumped 12x in two minutes, and outbound calls to a third-party billing API were timing out. My team owns checkout-api, but a quick git log on its repo showed nothing useful, because nobody on my team had merged anything in three days.

The actual change was on the networking team's side. A platform engineer had merged an egress-tightening PR earlier the same morning, replacing an open egress rule on a shared security group with a curated allow-list. The billing vendor's IP range wasn't on the new list. Nobody recognized the CIDRs in the diff well enough to catch it at review. The walkthrough below is the four-command sequence I'd run to find that PR starting from the alert.

Step 1: bound the change window from the alert timestamp

The alert fired at 14:32 UTC on a metric with a two-minute evaluation window, so the actual breakage started at roughly 14:30 UTC. I default to a one-hour look-back unless I have reason to widen it, so the change window is 13:30 UTC to 14:32 UTC.

The first command sets a couple of shell variables I'll reuse in every subsequent step:

START="2026-05-15T13:30:00Z"
END="2026-05-15T14:32:00Z"
SERVICE="checkout-api"

Two minutes well spent, because the rest of the investigation lives or dies on a tight window. Half the slow incident traces I've seen burned the budget here, opening tools with default 24-hour windows and reading through hundreds of irrelevant events.

Step 2: rule out the application deploy

Application deploys are the single most common change category, so they're the first hypothesis to eliminate.

kubectl -n production rollout history deployment/checkout-api

deployment.apps/checkout-api
REVISION  CHANGE-CAUSE
3         Deploy 4ce81a7 by github-actions
4         Deploy 9b2f10c by github-actions
5         Deploy 7d3a2e8 by github-actions

CHANGE-CAUSE only populates if your CI step explicitly annotates the deployment (kubectl annotate deployment/checkout-api kubernetes.io/change-cause="..."). The legacy --record flag has printed a deprecation warning since around Kubernetes 1.12 and is still present in current kubectl releases, but most modern pipelines have moved to explicit annotation or to reading the rolled-out image SHA from the deployment template directly.

To get the actual rollout timestamp, look at the deployment's events:

kubectl -n production describe deployment/checkout-api

...
Pod Template:
  Containers:
    api:
      Image:  registry.example.com/checkout-api@sha256:7d3a2e8...
...
Events:
  Type    Reason             Age    From                   Message
  ----    ------             ----   ----                   -------
  Normal  ScalingReplicaSet  2h17m  deployment-controller  Scaled up replica set checkout-api-6b8d4f7c5d to 3
  Normal  ScalingReplicaSet  2h15m  deployment-controller  Scaled down replica set checkout-api-7c4a2b9d68 to 0

The last rollout completed 2 hours 15 minutes ago, well outside the 13:30-to-14:32 window. The application deploy is not the cause. I move on instead of opening the deploy diff.

Step 3: query CloudTrail for infrastructure changes

Now I look at the cloud audit log for any modifications to the resources checkout-api depends on. The service talks to the billing API through the cluster's egress, and the egress is governed by a shared security group, so I query CloudTrail for security-group events in the window:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=AuthorizeSecurityGroupEgress \
  --start-time "$START" --end-time "$END" \
  --max-results 5 \
  --query 'Events[].[EventTime,EventName,Username,Resources[0].ResourceName]' \
  --output table

-----------------------------------------------------------------------------------------------
|                                       LookupEvents                                          |
+----------------------+--------------------------------+------------------------+------------+
|  2026-05-15T14:11:43Z|  AuthorizeSecurityGroupEgress  |  GitHubActions/r-78124 |sg-0abc123de|
|  2026-05-15T14:11:43Z|  AuthorizeSecurityGroupEgress  |  GitHubActions/r-78124 |sg-0abc123de|
+----------------------+--------------------------------+------------------------+------------+

I run the same query for RevokeSecurityGroupEgress and see one event from the same GitHubActions runner at 14:11:42. Read together, the three events are an apply that removed an existing egress rule and added two narrower ones. The runner was triggered by a PR merge, the security group is sg-0abc123de, and the change happened 21 minutes before the alert.

A few things I learned the hard way running this command on real incidents. The lookup-events API only retains the last 90 days of management events and rate-limits at 2 requests per second per region, so older traces or broader scans should go through CloudWatch Logs Insights or Athena instead. The structured CloudTrail event JSON (reference) carries requestParameters with the actual CIDR blocks added or removed, which the table view above hides. And the top-level Resources[] array is populated for some event types but not all; when the resource ID isn't in Resources[0].ResourceName, parse the full CloudTrailEvent JSON blob and pull from requestParameters directly (requestParameters.groupId for security groups, requestParameters.bucketName for S3, and so on).

When I don't know which resource type to suspect, I start with a broader filter on --lookup-attributes AttributeKey=ResourceType,AttributeValue=AWS::EC2::SecurityGroup (or whichever resource the affected service most directly depends on) before narrowing to specific event names.

Step 4: trace the CloudTrail event back to the IaC commit

I now have the security group ID, the timestamp, and the runner ID. The next step is finding the PR that triggered the runner.

The Terraform module that owns this security group lives in our infra monorepo. I jump there and walk the git log filtered to the same change window:

cd ~/repos/infra
git log --since="$START" --until="$END" --first-parent main \
  -- modules/networking/security_groups.tf

commit 9c8b1a4f...
Author: Pierre Martin <pierre@example.com>
Date:   Fri May 15 14:08:21 2026 +0200

    networking: tighten shared-egress to allowlist (#421)

Then the diff:

git show 9c8b1a4f -- modules/networking/security_groups.tf

I'm now looking at the actual change. The new allow-list CIDRs are visible in the diff. I cross-reference against the billing vendor's published IP range docs. The vendor's 203.0.113.0/24 is missing from the new allow-list. Root cause found, six minutes from alert.

The temporary fix is to add the vendor's 203.0.113.0/24 to the SG via the AWS console as a single new egress rule, then push a hotfix PR that adds the same CIDR to the allow-list in code so the next terraform apply doesn't revert it back. Restoring the original open egress would re-create the security exposure the original PR was hardening, so it's the wrong rollback even as a stopgap.

Why deployment markers would have collapsed Steps 2 to 4

The four-step sequence above takes around six minutes once you've done it twice. The first time, with no muscle memory, it took me 25 minutes including two wrong CloudTrail filters. Deployment markers eliminate most of that.

A deployment marker is a timestamped record posted from CI/CD when a commit reaches an environment. Posted to the Datadog Events API it looks like:

curl -X POST "https://api.datadoghq.com/api/v1/events" \
  -H "DD-API-KEY: $DD_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Deploy: networking PR #421",
    "text": "Commit 9c8b1a4 applied to production via Atlantis",
    "tags": ["env:production", "service:networking", "type:terraform-apply"],
    "alert_type": "info",
    "date_happened": 1778854301
  }'

With markers in place for both application deploys (already common) and terraform apply runs (still rare), the same investigation collapses to one query against the marker timeline filtered to the change window and the resource neighborhood of checkout-api. The query returns three rows: the unrelated application deploy at 12:15, the networking apply at 14:11, the runner ID. I jump straight to step 4 and skip CloudTrail entirely.

Honeycomb Markers and Sentry Releases expose equivalent primitives. The pattern is well-established for application deploys and structurally absent for infrastructure changes in most teams I've worked with. Adding terraform apply markers is half a day of pipeline work and pays back on the first incident.

How Annie collapses this into one query

I work at Anyshift, and watching Annie collapse the four-step sequence into a single query in a couple of seconds is what made me reverse-engineer the sequence in the first place. I wanted to know what she was actually doing under the hood when she answered "what changed near checkout-api in the last 24 hours?".

The platform builds a continuously-updated graph of the production infrastructure across cloud providers, Kubernetes, IaC state, and the connected CI/CD and observability tooling. The graph knows that checkout-api runs in EKS namespace production, that its pods route egress through sg-0abc123de, that the security group is managed in the infra monorepo under modules/networking, and that the most recent apply against that path landed at 14:11. When the alert fires, Annie joins the marker stream with the dependency graph, scopes to the affected resource's neighborhood, and returns one ranked list ordered by temporal proximity to the alert and dependency distance to checkout-api.

The honest annoyance: getting the dependency graph rich enough to make the ranking trustworthy is the work. The first month after a new customer connects, the graph knows the IaC topology and the Kubernetes routing but not the soft runtime dependencies (the third-party API a service calls without an explicit binding, the implicit routing through a service mesh that isn't in any module). That gap closes over a few weeks as Annie observes traces and incident patterns, but it is not zero on day one and we say so up front.

The next thing in flight is auto-routing the alert to the team that owns the responsible change rather than the team that owns the symptom. The four-step sequence above is the right answer for the on-call engineer. It is the wrong answer for the platform engineer in the next room over who could have rolled back the change before the alert fired if anything had told her it was hers.