A weekday-cron terraform plan -refresh-only is the standard drift-detection setup, and for most cases it works. What it missed once was a `publicly_accessible` flag on a dev RDS instance that somebody had flipped to true from the AWS console during an unrelated emergency on a Friday afternoon, leaving the instance publicly resolvable on DNS for 60+ hours until Monday morning's run caught it. The attached security group was still private the whole time, so nothing actually opened up, but the security review the following week called it a near-miss anyway, rightly.

The architecture below would have caught that flip inside two minutes. It is buildable across AWS, GCP, and Azure, and every config block in this post is paste-able into your own account.

Why terraform plan -refresh-only is not the architecture you ship

terraform plan -refresh-only is the obvious starting point, and the HashiCorp tutorial walks through it cleanly for the simple case of one state file with one provider. At scale it hits three structural limits:

  • It only sees resources inside the current configuration, so anything created manually and never imported is invisible.
  • It only diffs attributes the provider's resource schema knows about, so changes to fields in `lifecycle.ignore_changes` slip past.
  • It takes minutes to tens of minutes against a production-sized state, which caps the practical cadence at once or twice per day.

The 60-hour publicly_accessible flip happened inside that gap.

The architecture below subscribes to each cloud's audit log and reacts in seconds to minutes. Per-cloud setup follows.

AWS: CloudTrail through EventBridge to a normalizer Lambda

CloudTrail emits an event for every management-plane API call. EventBridge can subscribe to that stream in near-real-time (typically seconds to a few minutes for management events) and route matching events to a target without going through the default S3 / Athena batch path, which would add 5 to 15 minutes of latency.

The rule's event pattern lists the API actions that modify the resource types we care about. The list below is a starting set; extend it for the resource types your environment actually uses:

{
  "source": ["aws.ec2", "aws.rds", "aws.iam", "aws.s3", "aws.kms"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventName": [
      "AuthorizeSecurityGroupIngress",
      "AuthorizeSecurityGroupEgress",
      "RevokeSecurityGroupIngress",
      "RevokeSecurityGroupEgress",
      "ModifyDBInstance",
      "PutBucketPolicy",
      "PutBucketAcl",
      "DeletePublicAccessBlock",
      "AttachRolePolicy",
      "PutRolePolicy",
      "DetachRolePolicy",
      "ScheduleKeyDeletion",
      "DisableKey"
    ]
  }
}

Create the rule with the AWS CLI:

aws events put-rule \
  --name iac-drift-events \
  --event-pattern file://event-pattern.json \
  --state ENABLED

aws events put-targets \
  --rule iac-drift-events \
  --targets "Id=1,Arn=arn:aws:lambda:us-east-1:111122223333:function:drift-normalizer,DeadLetterConfig={Arn=arn:aws:sqs:us-east-1:111122223333:iac-drift-dlq}"

# Without this, EventBridge silently fails to invoke the Lambda
aws lambda add-permission \
  --function-name drift-normalizer \
  --statement-id allow-eventbridge-iac-drift \
  --action lambda:InvokeFunction \
  --principal events.amazonaws.com \
  --source-arn arn:aws:events:us-east-1:111122223333:rule/iac-drift-events

The rule fires per-region. Deploy it everywhere you operate (StackSet, or Terraform for_each over a region list). Larger orgs route all CloudTrail events into one audit account via an organization trail, so the EventBridge rule lives once.

The target needs a dead-letter queue (the SQS ARN in the --targets flag above), otherwise failed Lambda invocations are silently dropped after EventBridge's retry policy runs out. Skip the DLQ on day one and you'll learn about it the day your Lambda starts erroring on a malformed event.

Cost is mostly Lambda. Per the EventBridge pricing page, AWS service events delivered to the default event bus are free; the cost line is the Lambda invocations downstream, which stay well under $10 per month at typical drift-event volumes for a 50-account org.

One trap on the S3 side: management-plane events (PutBucketPolicy and friends) flow through CloudTrail by default, but S3 data-plane events (PutObject, etc.) require explicitly enabling CloudTrail data events and can change the cost story materially. Don't enable them globally without a budget plan.

GCP: Cloud Audit Logs through a Pub/Sub log sink

GCP's equivalent is a log sink that routes filtered audit logs to a Pub/Sub topic. The topic feeds the same normalizer Lambda (or a Cloud Function, depending on where you run the consumer).

The filter syntax uses protoPayload.methodName to match on the audit log's recorded API method. The pattern below covers the GCP equivalents of the AWS event list above:

gcloud logging sinks create iac-drift-sink \
  pubsub.googleapis.com/projects/PROJECT_ID/topics/iac-drift-events \
  --log-filter='
    logName="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Factivity"
    AND (
      protoPayload.methodName=~"compute\.(firewalls|networkFirewallPolicies|regionNetworkFirewallPolicies)\.(insert|patch|delete|addRule|patchRule|removeRule)"
      OR protoPayload.methodName=~"cloudsql\.instances\.(create|update|patch|delete)"
      OR protoPayload.methodName=~"storage\.buckets\.(setIamPolicy|update)"
      OR protoPayload.methodName=~"iam\.serviceAccounts\.(setIamPolicy|create|delete)"
      OR protoPayload.methodName=~"cloudkms\.cryptoKeyVersions\.(destroy|disable)"
    )
  ' \
  --project=PROJECT_ID

A couple of GCP namespace traps that cost me time. Cloud SQL audit-log methodName is cloudsql.instances.* (the audit-log namespace), not sqladmin.instances.* (the public API endpoint), so a filter against sqladmin.* matches nothing in the audit log. And the legacy compute.firewalls.* namespace only covers VPC firewall rules; the newer VPC firewall policies (the recommended modern pattern) live under compute.networkFirewallPolicies.* and compute.regionNetworkFirewallPolicies.*, with operations like addRule / patchRule / removeRule alongside the top-level CRUD. The regex above covers both.

After creation, the sink prints a writer service account identity. Grant it pubsub.publisher on the target topic, otherwise the sink silently drops events:

gcloud pubsub topics add-iam-policy-binding iac-drift-events \
  --member="serviceAccount:WRITER_SA_EMAIL" \
  --role="roles/pubsub.publisher" \
  --project=PROJECT_ID

Azure: Activity Log through Event Grid

Azure's equivalent is an Event Grid system topic on the subscription scope, filtered to the resource-write event types:

az eventgrid system-topic create \
  --name iac-drift-system-topic \
  --resource-group rg-monitoring \
  --source "/subscriptions/SUB_ID" \
  --topic-type Microsoft.Resources.Subscriptions \
  --location global

az eventgrid system-topic event-subscription create \
  --name iac-drift-sub \
  --resource-group rg-monitoring \
  --system-topic-name iac-drift-system-topic \
  --endpoint "https://drift-normalizer.azurewebsites.net/api/event" \
  --endpoint-type webhook \
  --included-event-types \
    "Microsoft.Resources.ResourceWriteSuccess" \
    "Microsoft.Resources.ResourceDeleteSuccess" \
    "Microsoft.Resources.ResourceActionSuccess"

Webhook endpoints must implement the Event Grid subscription validation handshake. Otherwise the subscription never enters Provisioned state. Host the consumer as an Azure Function with the Event Grid trigger, which handles the handshake automatically.

The Event Grid subscription delivers within seconds in the typical case. Azure's Activity Log retention is 90 days; for longer retention add an additional diagnostic setting that exports to a Log Analytics workspace or a storage account, which is independent of the Event Grid subscription and runs in parallel.

Normalizing all three streams into one event schema

Each provider emits a structurally different event envelope. The normalizer projects all three into a common shape so downstream code doesn't have to know which cloud the event came from.

A schema that's worked for me:

{
  "schema_version": 1,
  "id": "sha256-hex-of-source-event-id-and-cloud",
  "source_event_id": "<provider-native-event-id>",
  "timestamp": "2026-05-15T14:11:42Z",
  "cloud": "aws | gcp | azure",
  "account": "111122223333 | gcp-project-id | azure-subscription-id",
  "region": "us-east-1 | us-central1 | westeurope",
  "resource_type": "aws_security_group | google_compute_firewall | azurerm_network_security_group",
  "resource_id": "sg-0abc123de | projects/p/global/firewalls/f | /subscriptions/s/.../nsgs/n",
  "operation": "create | modify | delete",
  "modified_attributes": ["egress_rules"],
  "principal": {
    "type": "iam_role | service_account | federated_user | root",
    "id": "arn:aws:iam::111122223333:role/GitHubActions | sa@p.iam.gserviceaccount.com",
    "source_ip": "203.0.113.42"
  },
  "originating_iac": "terraform | cloudformation | console | api | unknown"
}

Two non-obvious schema choices. schema_version exists because the schema will change, and downstream consumers should fail fast on a version they don't understand. The id field is a deterministic hash of (cloud, source_event_id), not a fresh uuid.uuid4() per call, because all three providers do at-least-once delivery; a deterministic id lets the consumer dedupe on insert.

The Python normalizer for the AWS branch (the per-resource-type extractors are stubs you have to fill in for the resource types your environment uses; see notes below):

import hashlib
from typing import Any

SCHEMA_VERSION = 1

# Allowlist of full role ARNs your CI uses. Substring matching on role names
# is spoofable: a role named "evil-GitHubActions-shim" matches "GitHubActions"
# but is not your CI. Use full ARNs or a structured tag the CI sets on the
# assume-role session, never a name fragment.
CI_ROLE_ARNS = frozenset({
    "arn:aws:iam::111122223333:role/GitHubActions",
    "arn:aws:iam::111122223333:role/AtlantisRunner",
})

# AWS SSO sessions look like AWSReservedSSO_<permission-set>_<random> and are
# the strongest signal of a human in the console under IAM Identity Center.
SSO_PREFIX = "AWSReservedSSO_"

def _idempotency_id(cloud: str, source_event_id: str) -> str:
    return hashlib.sha256(f"{cloud}:{source_event_id}".encode()).hexdigest()

def normalize_aws_event(event: dict[str, Any]) -> dict[str, Any]:
    detail = event["detail"]
    user = detail.get("userIdentity", {})
    source_event_id = detail["eventID"]  # globally unique per CloudTrail event
    return {
        "schema_version": SCHEMA_VERSION,
        "id": _idempotency_id("aws", source_event_id),
        "source_event_id": source_event_id,
        "timestamp": detail["eventTime"],
        "cloud": "aws",
        "account": detail["recipientAccountId"],
        "region": detail["awsRegion"],
        "resource_type": _resource_type_from_event_name(detail["eventName"]),
        "resource_id": _extract_resource_id(detail),
        "operation": _operation_from_event_name(detail["eventName"]),
        "modified_attributes": _extract_modified_attributes(detail),
        "principal": {
            "type": user.get("type", "unknown").lower(),
            "id": user.get("arn") or user.get("userName") or "unknown",
            "source_ip": detail.get("sourceIPAddress"),
        },
        "originating_iac": _classify_origin(user),
    }

def _classify_origin(user: dict[str, Any]) -> str:
    if user.get("invokedBy") == "AWS Internal":
        return "managed-service"
    if user.get("type") != "AssumedRole":
        return "api" if user.get("type") == "IAMUser" else "unknown"
    session_issuer = user.get("sessionContext", {}).get("sessionIssuer", {})
    issuer_arn = session_issuer.get("arn", "")
    if issuer_arn in CI_ROLE_ARNS:
        return "terraform"
    role_name = issuer_arn.rsplit("/", 1)[-1] if issuer_arn else ""
    if role_name.startswith(SSO_PREFIX):
        return "console"
    return "unknown"

The four helper functions are intentionally stubs. They are the tedious part: every CloudTrail event has its own envelope shape, and the resource ID lives in different fields per event (requestParameters.groupId for security groups, responseElements.dBInstance.dBInstanceArn for RDS modify). Implement per resource type as you extend the EventBridge rule. There is no clever generic.

Joining the normalized event against Terraform state

A drift event is a normalized event whose resource_id exists in IaC state but whose originating_iac is something other than terraform. The join is mechanical, but the obvious one-liner against .values.root_module.resources[] only sees top-level resources and misses everything inside any module call. Real state files have nested modules, often several levels deep, so the extract has to recurse:

terraform show -json terraform.tfstate \
  | jq '
      def all_resources:
        (.resources[]?),
        (.child_modules[]? | all_resources);
      .values.root_module
      | all_resources
      | select(.values.id != null)
      | {address, type, id: .values.id, arn: .values.arn}
    '
{"address": "module.networking.aws_security_group.shared", "type": "aws_security_group", "id": "sg-0abc123de", "arn": "arn:aws:ec2:us-east-1:111122223333:security-group/sg-0abc123de"}
{"address": "module.databases.aws_db_instance.dev", "type": "aws_db_instance", "id": "dev-db", "arn": "arn:aws:rds:us-east-1:111122223333:db:dev-db"}

The state extract becomes a lookup table keyed on resource_id. For each incoming normalized event, the consumer looks up the resource: if it's in the table, the event is potentially-managed; if originating_iac isn't terraform, the event is drift; the resource's Terraform address is now known and can be attached to the alert so the on-call knows which module to reconcile from.

Tradeoff matrix: three architectures for drift detection

If weekly drift is acceptable for your scale, the cron-based plan is fine. The EventBridge architecture above is the right call when you need same-day detection across multi-cloud and multi-state-file environments.

How Anyshift ships this as a managed capability

I work at Anyshift. The architecture above is a sketch of what we run. The non-trivial pieces are the per-resource-type schema mappers, the cross-state-file IaC inventory, and the severity ranker.

The honest annoyance is the severity ranker. Conservative by default. So the on-call sees more medium-severity events in week one than is comfortable. We expose a dial. The fix in flight learns from the on-call's first-week dismissals to drop noise by week four. Customers tune by hand until that ships.

Against the original incident, the alert would have fired about two minutes after the console flip and would have included the IAM principal who made the change, the modified attribute, and the Terraform address of the affected instance. Whoever was on-call would have known exactly which module to reconcile from, and would not have lost the weekend wondering. Two to three weeks of work for one engineer to do all three clouds. One cloud at a time, closer to a week. That's the price tag.