A development RDS instance had its `publicly_accessible` flag flipped to true through the cloud console on a Friday afternoon for an unrelated emergency. The Terraform configuration on the instance still said false. The team ran drift detection once a weekday, so the instance stayed publicly resolvable through DNS until Monday morning's refresh caught it: 60+ hours of unintended public DNS exposure. Luckily the attached security group was still private and nothing actually opened to the internet, but the security review the following week treated it as a near-miss anyway, and rightly.
The architecture that would have caught the flip in minutes instead of in a weekend is a buildable thing across AWS, GCP, and Azure. Every config block in this post is paste-able into your own account.
Why terraform plan -refresh-only is not the architecture you ship
terraform plan -refresh-only is the obvious starting point and the HashiCorp resource-drift tutorial walks through it cleanly. It works for the simple case: one state file, one provider, attributes the configuration controls. At scale it runs into three structural limits that are not always obvious until you hit them.
It only sees resources inside the current configuration, so a resource created manually in the same account but never imported is invisible. It only diffs attributes the provider's resource schema knows about, so out-of-band changes to schema-ignored fields (or fields you've put in `lifecycle.ignore_changes` for legitimate reasons) slip past on the next refresh. And it takes minutes to tens of minutes against a production-sized state, which caps the practical detection cadence at once or twice per day in most pipelines. The 60-hour publicly_accessible flip happened entirely inside that gap.
The architecture below subscribes to each cloud's audit log and reacts in seconds to minutes. Per-cloud setup follows.
AWS: CloudTrail through EventBridge to a normalizer Lambda
CloudTrail emits an event for every management-plane API call. EventBridge can subscribe to that stream in near-real-time (typically seconds to a few minutes for management events) and route matching events to a target without going through the default S3 / Athena batch path, which would add 5 to 15 minutes of latency.
The rule's event pattern lists the API actions that modify the resource types we care about. The list below is a starting set; extend it for the resource types your environment actually uses:
{
"source": ["aws.ec2", "aws.rds", "aws.iam", "aws.s3", "aws.kms"],
"detail-type": ["AWS API Call via CloudTrail"],
"detail": {
"eventName": [
"AuthorizeSecurityGroupIngress",
"AuthorizeSecurityGroupEgress",
"RevokeSecurityGroupIngress",
"RevokeSecurityGroupEgress",
"ModifyDBInstance",
"PutBucketPolicy",
"PutBucketAcl",
"DeletePublicAccessBlock",
"AttachRolePolicy",
"PutRolePolicy",
"DetachRolePolicy",
"ScheduleKeyDeletion",
"DisableKey"
]
}
}Create the rule with the AWS CLI:
aws events put-rule \
--name iac-drift-events \
--event-pattern file://event-pattern.json \
--state ENABLED
aws events put-targets \
--rule iac-drift-events \
--targets "Id=1,Arn=arn:aws:lambda:us-east-1:111122223333:function:drift-normalizer,DeadLetterConfig={Arn=arn:aws:sqs:us-east-1:111122223333:iac-drift-dlq}"
# Without this, EventBridge silently fails to invoke the Lambda
aws lambda add-permission \
--function-name drift-normalizer \
--statement-id allow-eventbridge-iac-drift \
--action lambda:InvokeFunction \
--principal events.amazonaws.com \
--source-arn arn:aws:events:us-east-1:111122223333:rule/iac-drift-eventsA few practical notes on shipping this in a real account. The rule fires per-region, so deploy it in every region you operate in (CloudFormation StackSet or Terraform for_each over a region list). Most large orgs simplify the per-region per-account spread by routing all CloudTrail events into a single audit account via an organization trail, so the EventBridge rule lives once on the audit account's custom bus rather than fifty times across member accounts.
The target needs a dead-letter queue (the SQS ARN in the --targets flag above), otherwise failed Lambda invocations are silently dropped after EventBridge's retry policy runs out. Skip the DLQ on day one and you'll learn about it the day your Lambda starts erroring on a malformed event.
Cost is mostly Lambda. Per the EventBridge pricing page, AWS service events delivered to the default event bus are free; the cost line is the Lambda invocations downstream, which stay well under $10 per month at typical drift-event volumes for a 50-account org.
One trap on the S3 side: management-plane events (PutBucketPolicy and friends) flow through CloudTrail by default, but S3 data-plane events (PutObject, etc.) require explicitly enabling CloudTrail data events and can change the cost story materially. Don't enable them globally without a budget plan.
GCP: Cloud Audit Logs through a Pub/Sub log sink
GCP's equivalent is a log sink that routes filtered audit logs to a Pub/Sub topic. The topic feeds the same normalizer Lambda (or a Cloud Function, depending on where you run the consumer).
The filter syntax uses protoPayload.methodName to match on the audit log's recorded API method. The pattern below covers the GCP equivalents of the AWS event list above:
gcloud logging sinks create iac-drift-sink \
pubsub.googleapis.com/projects/PROJECT_ID/topics/iac-drift-events \
--log-filter='
logName="projects/PROJECT_ID/logs/cloudaudit.googleapis.com%2Factivity"
AND (
protoPayload.methodName=~"compute\.(firewalls|networkFirewallPolicies|regionNetworkFirewallPolicies)\.(insert|patch|delete|addRule|patchRule|removeRule)"
OR protoPayload.methodName=~"cloudsql\.instances\.(create|update|patch|delete)"
OR protoPayload.methodName=~"storage\.buckets\.(setIamPolicy|update)"
OR protoPayload.methodName=~"iam\.serviceAccounts\.(setIamPolicy|create|delete)"
OR protoPayload.methodName=~"cloudkms\.cryptoKeyVersions\.(destroy|disable)"
)
' \
--project=PROJECT_IDA couple of GCP namespace traps that cost me time. Cloud SQL audit-log methodName is cloudsql.instances.* (the audit-log namespace), not sqladmin.instances.* (the public API endpoint), so a filter against sqladmin.* matches nothing in the audit log. And the legacy compute.firewalls.* namespace only covers VPC firewall rules; the newer VPC firewall policies (the recommended modern pattern) live under compute.networkFirewallPolicies.* and compute.regionNetworkFirewallPolicies.*, with operations like addRule / patchRule / removeRule alongside the top-level CRUD. The regex above covers both.
After creation, the sink prints a writer service account identity. Grant it pubsub.publisher on the target topic, otherwise the sink silently drops events:
gcloud pubsub topics add-iam-policy-binding iac-drift-events \
--member="serviceAccount:WRITER_SA_EMAIL" \
--role="roles/pubsub.publisher" \
--project=PROJECT_IDCloud Audit Logs admin-activity ingestion is free into the _Required and _Default log buckets, but the sink-to-Pub/Sub path adds Pub/Sub publish, throughput, and consumer-side compute costs. For a 50-project org with a tightly-scoped filter like the one above, expect low-single-digit dollars per month per the Pub/Sub pricing page; broaden the filter and the bill scales with event volume.
Azure: Activity Log through Event Grid
Azure's equivalent is an Event Grid system topic on the subscription scope, filtered to the resource-write event types:
az eventgrid system-topic create \
--name iac-drift-system-topic \
--resource-group rg-monitoring \
--source "/subscriptions/SUB_ID" \
--topic-type Microsoft.Resources.Subscriptions \
--location global
az eventgrid system-topic event-subscription create \
--name iac-drift-sub \
--resource-group rg-monitoring \
--system-topic-name iac-drift-system-topic \
--endpoint "https://drift-normalizer.azurewebsites.net/api/event" \
--endpoint-type webhook \
--included-event-types \
"Microsoft.Resources.ResourceWriteSuccess" \
"Microsoft.Resources.ResourceDeleteSuccess" \
"Microsoft.Resources.ResourceActionSuccess"One operational gotcha that sinks half the first-time deployments: webhook endpoints must implement the Event Grid subscription validation handshake. On event-subscription create, Event Grid sends a Microsoft.EventGrid.SubscriptionValidationEvent and expects either an inline response containing the validation code or a manual GET to the validation URL. If the endpoint isn't built to handle it, the create command fails at validation and the subscription never enters Provisioned state. The cleanest path is to host the consumer as an Azure Function with the Event Grid trigger, which handles the handshake automatically; bare webhooks on App Service or Functions HTTP triggers must implement it explicitly.
The Event Grid subscription delivers within seconds in the typical case. Azure's Activity Log retention is 90 days; for longer retention add an additional diagnostic setting that exports to a Log Analytics workspace or a storage account, which is independent of the Event Grid subscription and runs in parallel.
Normalizing all three streams into one event schema
Each provider emits a structurally different event envelope. The normalizer's job is to project all three into a common shape so downstream code (the IaC-state join, the severity ranker, the on-call notifier) doesn't have to know which cloud the event came from.
A schema that's worked for me:
{
"schema_version": 1,
"id": "sha256-hex-of-source-event-id-and-cloud",
"source_event_id": "<provider-native-event-id>",
"timestamp": "2026-05-15T14:11:42Z",
"cloud": "aws | gcp | azure",
"account": "111122223333 | gcp-project-id | azure-subscription-id",
"region": "us-east-1 | us-central1 | westeurope",
"resource_type": "aws_security_group | google_compute_firewall | azurerm_network_security_group",
"resource_id": "sg-0abc123de | projects/p/global/firewalls/f | /subscriptions/s/.../nsgs/n",
"operation": "create | modify | delete",
"modified_attributes": ["egress_rules"],
"principal": {
"type": "iam_role | service_account | federated_user | root",
"id": "arn:aws:iam::111122223333:role/GitHubActions | sa@p.iam.gserviceaccount.com",
"source_ip": "203.0.113.42"
},
"originating_iac": "terraform | cloudformation | console | api | unknown"
}A couple of the schema choices are non-obvious enough to call out. schema_version is there because the schema will change as you cover more resource types, and downstream consumers need a way to fail fast on a version they don't understand rather than silently mis-parsing it. The id field is a deterministic hash of (cloud, source_event_id) rather than a fresh uuid.uuid4() per call, because EventBridge, Pub/Sub, and Event Grid are all at-least-once delivery and the same source event will sometimes arrive at the normalizer twice. A deterministic id lets the downstream consumer dedupe on insert and avoid emitting two drift alerts for one underlying change.
The Python normalizer for the AWS branch (the per-resource-type extractors are stubs you have to fill in for the resource types your environment uses; see notes below):
import hashlib
from typing import Any
SCHEMA_VERSION = 1
# Allowlist of full role ARNs your CI uses. Substring matching on role names
# is spoofable: a role named "evil-GitHubActions-shim" matches "GitHubActions"
# but is not your CI. Use full ARNs or a structured tag the CI sets on the
# assume-role session, never a name fragment.
CI_ROLE_ARNS = frozenset({
"arn:aws:iam::111122223333:role/GitHubActions",
"arn:aws:iam::111122223333:role/AtlantisRunner",
})
# AWS SSO sessions look like AWSReservedSSO_<permission-set>_<random> and are
# the strongest signal of a human in the console under IAM Identity Center.
SSO_PREFIX = "AWSReservedSSO_"
def _idempotency_id(cloud: str, source_event_id: str) -> str:
return hashlib.sha256(f"{cloud}:{source_event_id}".encode()).hexdigest()
def normalize_aws_event(event: dict[str, Any]) -> dict[str, Any]:
detail = event["detail"]
user = detail.get("userIdentity", {})
source_event_id = detail["eventID"] # globally unique per CloudTrail event
return {
"schema_version": SCHEMA_VERSION,
"id": _idempotency_id("aws", source_event_id),
"source_event_id": source_event_id,
"timestamp": detail["eventTime"],
"cloud": "aws",
"account": detail["recipientAccountId"],
"region": detail["awsRegion"],
"resource_type": _resource_type_from_event_name(detail["eventName"]),
"resource_id": _extract_resource_id(detail),
"operation": _operation_from_event_name(detail["eventName"]),
"modified_attributes": _extract_modified_attributes(detail),
"principal": {
"type": user.get("type", "unknown").lower(),
"id": user.get("arn") or user.get("userName") or "unknown",
"source_ip": detail.get("sourceIPAddress"),
},
"originating_iac": _classify_origin(user),
}
def _classify_origin(user: dict[str, Any]) -> str:
if user.get("invokedBy") == "AWS Internal":
return "managed-service"
if user.get("type") != "AssumedRole":
return "api" if user.get("type") == "IAMUser" else "unknown"
session_issuer = user.get("sessionContext", {}).get("sessionIssuer", {})
issuer_arn = session_issuer.get("arn", "")
if issuer_arn in CI_ROLE_ARNS:
return "terraform"
role_name = issuer_arn.rsplit("/", 1)[-1] if issuer_arn else ""
if role_name.startswith(SSO_PREFIX):
return "console"
return "unknown"The four _resource_type_from_event_name / _extract_resource_id / _operation_from_event_name / _extract_modified_attributes helpers are intentionally stubs in the snippet above. They are the tedious part of the work: every CloudTrail event for a given resource type has its own envelope shape, and the resource ID lives in different fields per event (requestParameters.groupId for security groups, responseElements.dBInstance.dBInstanceArn for RDS modify, requestParameters.bucketName for S3, etc.). Implement them per resource type as you extend the EventBridge rule's eventName list. There is no clever generic version.
The GCP and Azure normalizers follow the same overall shape against their own audit-event envelopes. The classifier is the most opinionated piece, and the highest-leverage one: correctly tagging terraform vs console vs managed-service events is what makes the downstream noise tractable. Audit-log fields the classifier should never trust as ground truth on their own: any user-controlled string (role name fragments, session names without a known prefix, tags), and any field that's optional in the event schema.
Joining the normalized event against Terraform state
A drift event is a normalized event whose resource_id exists in IaC state but whose originating_iac is something other than terraform. The join is mechanical, but the obvious one-liner against .values.root_module.resources[] only sees top-level resources and misses everything inside any module call. Real state files have nested modules, often several levels deep, so the extract has to recurse:
terraform show -json terraform.tfstate \
| jq '
def all_resources:
(.resources[]?),
(.child_modules[]? | all_resources);
.values.root_module
| all_resources
| select(.values.id != null)
| {address, type, id: .values.id, arn: .values.arn}
'{"address": "module.networking.aws_security_group.shared", "type": "aws_security_group", "id": "sg-0abc123de", "arn": "arn:aws:ec2:us-east-1:111122223333:security-group/sg-0abc123de"}
{"address": "module.databases.aws_db_instance.dev", "type": "aws_db_instance", "id": "dev-db", "arn": "arn:aws:rds:us-east-1:111122223333:db:dev-db"}The state extract becomes a lookup table keyed on resource_id. For each incoming normalized event, the consumer looks up the resource: if it's in the table, the event is potentially-managed; if originating_iac isn't terraform, the event is drift; the resource's Terraform address is now known and can be attached to the alert so the on-call knows which module to reconcile from.
For multi-state-file environments the same join repeats per state file, with the lookup tables merged. The state extracts can be regenerated on a schedule (every 15 minutes is a reasonable starting cadence) and cached. A subtle cost trap: terraform show -json on large states can pull tens of MB and take several seconds; if you run it from inside the EventBridge consumer per event you'll burn money and rate-limit yourself out of the Terraform Cloud / Atlantis state API. Pre-compute the lookup table out of band and read it from a key-value store the consumer queries in microseconds.
Tradeoff matrix: three architectures for drift detection
| Approach | Latency | Setup | Cost (50-acct AWS org/month) | Coverage |
|---|---|---|---|---|
terraform plan -refresh-only on cron | hours-to-days | low | $0 beyond CI minutes | configuration-scope only |
| AWS Config + Aggregator + Conformance Packs | minutes | medium | ~$0.003/config item recorded × resource churn (often $50-200) | AWS-only, native multi-account |
| CloudTrail S3 batch + custom diff | 5-15 min | medium | ~$3/account S3 + Lambda | broad, per-region |
| EventBridge → normalizer → state join | seconds-to-minutes | high | $0 EventBridge default bus + Lambda invocations (~$10) | broad, real-time, multi-cloud |
Pick row 1 if you have one Terraform configuration, one cloud, and weekly drift is acceptable. AWS Config (row 2) is the AWS-native answer and the right call if you live entirely on AWS, want a managed service rather than custom code, and don't mind that the Conformance Pack drift concept is built around CloudFormation rather than Terraform; it does not natively join against terraform show -json output, so you still write the IaC-state correlation yourself. Pick row 4 (the EventBridge architecture above) if you have multiple state files, multiple clouds, and the drift you care about is a same-day incident class. Row 3 is a transitional architecture: it pre-dates EventBridge and CloudWatch Logs subscription filters being broadly available, and the only reason to choose it now is if you're inheriting an existing S3-batch pipeline that already works.
How Anyshift ships this as a managed capability
I work at Anyshift. The architecture above is a sketch of what we run, with a few pieces that are non-trivial to build yourself: the per-resource-type schema mappers (every cloud has hundreds of resource types and the mapping is mechanical but tedious), the cross-state-file IaC inventory that resolves a resource_id back to an address even when the resource is managed in a sibling state file the consumer doesn't directly know about, and the severity ranker that scores each drift event against resource sensitivity, change reversibility, and originating identity.
The honest annoyance from our side is the severity ranker. It's conservative by default, so the on-call sees more medium-severity events than is comfortable in week one. We expose a dial as the stop-gap. What's in flight is a per-tenant default that learns from the on-call's first-week dismissals so the noise drops on its own by week four. Until that ships, customers tune the ranker by hand, and we say so up front.
The 60-hour publicly_accessible flip would have surfaced inside two minutes of the console patch landing, with the originating IAM principal, the modified attribute, and the Terraform address of the affected instance attached to the alert. The architecture that gets you there is realistically two to three weeks for a single engineer to ship across all three clouds with the operational details (DLQ, multi-account aggregation, schema versioning, idempotency, Azure validation handshake, the per-resource extractors) included; one cloud at a time it is closer to a week. Either way it is more engineering than terraform plan looks like on a roadmap, and it is the difference between drift you find on Monday and drift you find on Friday.
