Start Today

Feb 3, 2025

Identity and Access Management (IAM): A Deep Dive in AWS Resources & Best Practices to Adopt

Article Series

(Expert) A Deep Dive in AWS Resources & Best Practices to Adopt

0 min read

Identity and Access Management (IAM) is central to any cloud infrastructure.

At a high level, IAM is about assigning permissions to entities to allow or deny these entities to perform actions in your cloud environment. IAM is the foundation that the rest of your cloud environment is built on.

Each cloud provider has their version of IAM. In this blog post we will specifically talk about IAM and related concepts in the context of Amazon Web Services (AWS). We will walk through a scenario where working with IAM through Terraform can lead to potential issues in your environment, and show how you can address these types of issues. Finally, we'll discuss some general best practices around IAM.

For a broader look at AWS best practices, including DNS, check out our next article on DNS: A Deep Dive in AWS Resources & Best Practices to Adopt.

What is Identity and Access Management on AWS?

When you interact with AWS, you first have to authenticate to the platform using your IAM credentials. IAM is the first barrier to entry to AWS.

On AWS you have a few different types of entities, or principals. A principal is an entity that you can assign permissions to allow it to perform actions in your AWS environment.

The three most common types of entities are users, groups and roles. A user is exactly what it sounds like. Your employees could have their own user on AWS.

You can group users into groups for easier administration. This allows you to assign permissions to many users that should have the same level of access.

A role is a machine entity on AWS. When you build applications or use other AWS services you assign them roles, and to these roles you will assign permissions for what they are allowed to do in your AWS environment.

Permissions on AWS come in the form of policies. A policy is a collection of one or more specific permissions that says what a user or role is allowed to do. There are two major types of policies:

Policies that you attach to entities (users, groups, roles).
Policies that you attach to resources.

The second type of policy is supported for a number of resources, including S3 buckets, SNS topics and API Gateways.

You can use both types of policies together. If you are working on applications on AWS that span multiple AWS accounts you are required to use resource policies in many cases.

The content of a policy consists of one or more policy statements. Each statement is one or more permissions that are either allowed or denied, for a given resource or resources.

An example of a policy that you assign to an entity is the AdministratorAccess policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "*",
            "Resource": "*"
        }
    ]
}

This policy consists of a single statement. This statement allows all actions on all resources. It is generally a policy you should be careful in assigning to your principals.

How to break your infrastructure through IAM

Given the importance of IAM on AWS it is clear that breaking an IAM resource can have far-reaching consequences. If your IAM roles, users, or policies change unexpectedly, it will impact your infrastructure and your applications.

In this section we will go through a scenario for how your infrastructure can break through changes in IAM resources.

You are a platform engineer at an organization that runs a microservices architecture on AWS. You work in the central platform team, and one of your responsibilities is to manage the shared AWS API Gateway resource that is the entrypoint to parts of your microservices architecture. The API Gateway handles internal and external traffic.

A part of managing the API Gateway resource is to handle access control to the API using resource policies. A resource policy is an IAM policy where you can allow or deny access to the API and its methods.

The current resource policy looks like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/development-team"
      },

      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:eu-west-1:123456789012:p61xuhn9cd"
    }
  ]
}

In essence, the policy allows one of your development teams to invoke all methods of the API resource. The development team has a role aptly named development-team. This role exists in an AWS account with ID 123456789012.

Your organization is using Terraform to set up all your infrastructure. You keep all infrastructure code in a git repository.

The current Terraform configuration for the API Gateway and its resource policy uses the following data source as a reference to the IAM role of the development team:

data "aws_iam_role" "development_team" {
  name = "development-team"
}

The development team is working on a new version of their application and infrastructure. The team is not aware of how their infrastructure is referenced in other parts of the organization, but they assume they should be able to update their own infrastructure without encountering any issues.

As part of their new infrastructure setup they will change the name of their IAM role from development-team to development-team-a to better reflect their team name. They will also delete the old role named development-team.

The development team performs these changes without first checking with you and your colleagues in the platform team.

Shortly after the development team has performed their changes you notice a sudden increase in "403 Forbidden" responses in the API Gateway.

After some troubleshooting you discover that there is a new role named development-team-a that is trying to use the API. You contact the team who informs you of the changes they have made, and you enlighten them on the current issue you are seeing.

The development team quickly rolls their change back, switching the name of the IAM role back to the original name of development-team and recreates the role.

You assume that the issue has been resolved since the original IAM role is now back. However, you notice that the rate of 403 responses stays steady even after the development team rolled back their change.

You wonder what is going on, and decide to inspect API Gateway's resource policy. You discover the following:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "AROAZE64MWFAASURXM4E6"
      },
      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:eu-west-1:123456789012:p61xuhn9cd"
    }
  ]
}

The policy does not look at all like what you expected. Instead of listing the IAM role in the resource policy there is now a random ID AROAZE64MWFAASURXM4E6. Where does this ID come from?

You look at the Terraform configuration for the API Gateway resource policy and see that it looks correct:

data "aws_iam_role" "development_team" {
  name = "development-team"
}

data "aws_iam_policy_document" "test" {
  statement {
   effect = "Allow"
    principals {
      type = "AWS"
      identifiers = [
        data.aws_iam_role.development_team.arn,
      ]
    }
    # parts of the code left out for brevity
  }
}

The aws_iam_policy_document is correctly referencing the aws_iam_role data source. The data source is correctly referencing a role named development-team.

In a desperate attempt at fixing the issue you apply the Terraform configuration again to bring the infrastructure back to the desired state. The output from terraform plan seems to indicate that the resource policy will be changed back to a working state:

Terraform will perform the following actions:

  # aws_api_gateway_rest_api_policy.test will be updated in-place
  ~ resource "aws_api_gateway_rest_api_policy" "test" {
        id          = "p61xuhn9cd"
      ~ policy      = jsonencode(
          ~ {
              ~ Statement = [
                  ~ {
                      ~ Principal = {
                          ~ AWS = "AROAZE64MWFAASURXM4E6" -> "arn:aws:iam::123456789012:role/development-team"
                        }
                        # (3 unchanged attributes hidden)
                    },
                ]
                # (1 unchanged attribute hidden)
            }
        )
        # (1 unchanged attribute hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy

Specifically, the random ID AROAZE64MWFAASURXM4E6 will be replaced by arn:aws:iam::123456789012:role/development-team as you expect.

You run terraform apply and hope that the issue will be resolved.

The change is applied and Terraform reports success. However, the 403 errors stay at the same alarming rate. Terraform has not updated the resource policy!

You panic.

You run another terraform apply to see what the current state of your infrastructure is. You discover that the plan looks exactly like the previous plan. You have discovered what appears to be a bug in the AWS provider, and you are not sure how to proceed.

After an emergency meeting with the development team you decide to let the development team update their IAM role name to the new name of development-team-a, and you update your resource policy to accommodate these changes:

data "aws_iam_role" "development_team" {
  # you update the name in this data source
  name = "development-team-a"
}

data "aws_iam_policy_document" "test" {
  statement {
    effect = "Allow"
    principals {
      type = "AWS"
      identifiers = [
        # the reference stays the same
        data.aws_iam_role.development_team.arn,
      ]
    }
    # parts of the code left out for brevity
  }
}

The output from terraform plan once again indicates that the desired change will be performed:

Terraform will perform the following actions:

  # aws_api_gateway_rest_api_policy.test will be updated in-place
  ~ resource "aws_api_gateway_rest_api_policy" "test" {
        id          = "p61xuhn9cd"
      ~ policy      = jsonencode(
          ~ {
              ~ Statement = [
                  ~ {
                      ~ Principal = {
                          ~ AWS = "AROAZE64MWFAASURXM4E6" -> "arn:aws:iam::629138043200:role/development-team-a"
                        }
                        # (3 unchanged attributes hidden)
                    },
                ]
                # (1 unchanged attribute hidden)
            }
        )
        # (1 unchanged attribute hidden)
    }
Plan: 0 to add, 1 to change, 0 to destroy

The random ID AROAZE64MWFAASURXM4E6 value will now be replaced by arn:aws:iam::123456789012:role/development-team-a . You are not hopeful, but you apply the change anyway.

The errors disappear! It seems like this time Terraform performed the change.

You notice that the API Gateway resource policy looks better this time:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/development-team-a"
      },
      "Action": "execute-api:Invoke",
      "Resource": "arn:aws:execute-api:eu-west-1:123456789012:p61xuhn9cd"
    }
  ]
}

You and the rest of the platform engineering team set up a post mortem meeting together with the development team to discuss how you can avoid issues like this in the future.

The scenario described above can, and has, happened.

Each principal in AWS has a unique principal ID, an example of such an ID is AROAZE64MWFAASURXM4E6.

Destroying an IAM role and creating a new role with the same name might seem like a safe thing to do, but it is not. The new IAM role has a new principal ID that does not match the original IAM role. There are a few places where this distinction is important, and one of them is in the API Gateway resource policy.

What could the platform and development teams have done to avoid this issue? There are a number of things that could have been done:

Knowledge: knowing about principal IDs and how resource policies behave could have prompted the development team to think one step further before applying their changes. Doing this successfully requires skill, communication and a bit of luck.
Rearchitect your Terraform configurations: currently the platform team uses data sources to read data (e.g. IAM role ARN) from development teams. This is a hidden dependency between different Terraform configurations, and it requires certain attributes of resources to stay the same. You could bring related parts into a single Terraform configuration and make sure they are always updated together. This might not be feasible for large distributed pieces of infrastructure.
Discover dependencies: if you have a way to discover dependencies between your Terraform configurations you could automate the process of discovery and inform of any potential issues a change in one Terraform configuration could mean in a different Terraform configuration. This process could be included in your CI/CD systems, pull-requests, and more.

Of the three options discussed above, the third option is the most reasonable in all situations. It does not require you to refactor all of your Terraform configurations into one large mono-infrastructure.

To implement a solution for the third option described above, you can use Anyshift. Anyshift allows you to discover cloud dependencies using AI-driven insights. This provides you with a simpler way to catch issues where updating one part of your infrastructure breaks another part of your infrastructure.

To get started with Anyshift, configure integrations with your GitHub environment, your AWS account and your Terraform remote state storage on Amazon S3 (see the following image).

Once the integrations are set up, generate the Anyshift graph. This is an AWS knowledge graph of your environment including how resources on AWS maps to source code on GitHub.

With the integrations set up, open a pull-request to update the IAM role name described in the scenario above. Anyshift analyzes the proposed change and presents an impact analysis:

Further down the impact analysis you can read the analysis of the change in natural language and get insights to take actions on:

Anyshift works as a sophisticated Terraform drift detection that acts before a drift is introduced.

Visit the documentation to learn more about Anyshift.

Best practices for AWS IAM

There are a number of best practices around AWS IAM that you should implement.

Use the principle of least privilege

When writing IAM policies for your entities, use the principle of least privilege. This means that you should assign the permissions that an entity needs to perform its job, but no more. A role that only needs to read blobs in an S3 bucket does not need to have full administrator access.

You can restrict policies both in terms of the permissions you assign to an entity, and in terms of which resources the permissions apply to. You can also take it one step further to use conditions for when the permissions are valid.

Use multi-factor authentication

Use multi-factor authentication (MFA) for all your human user accounts on AWS. The benefit is that even if the user's password is leaked, it will not be enough to get access to the account.

Secure the AWS root account

The AWS root account should only be used for initial setup of the AWS account. As with any other user on AWS, enable MFA. You could use a physical MFA device (e.g. a Yubi key) that you store in a secure location.

Use roles instead of users

If possible, use IAM roles instead of IAM users. Roles use temporary security credentials by default which minimizes risks of leaked credentials. You can assign roles to your applications and services running on AWS and assign them the permissions they need.

For your normal users, you can enable SSO sign-in and connect signed-in users to roles instead of IAM user accounts.

Audit IAM activity

You should enable CloudTrail logs for your account and set up alerts for unusual events. You should alert for any activity performed by the root user account to make sure you are aware when this account is used.

Educate your organization on security best practices

IAM security is paramount for your cloud environment. You should educate your organization on how to use IAM and to follow best practices.

Shift security left

A general best practice is to shift-security left. Use IAM policies, service control policies (SCPs) and governance tools such as AWS IAM Access Analyzer to secure your environment. Perform a security analysis of each change you introduce in your environment. Integrate security features into your complete AWS environment and beyond.

Summary

AWS Identity and Access Management (IAM) is a central service on AWS.

Two important concepts of IAM include principals (users, groups, roles) and policies (permissions). Policies control what is allowed or denied in your environment, and these policies control what your principals can do in your AWS accounts.

Given the centrality of IAM it is clear that this is a sensitive part of your infrastructure, where misconfigurations can lead to a complete stop in your production environment.

In this blog post we followed a scenario where parts of a microservices architecture experienced an issue due to a broken policy from an innocent change of an IAM role.

The infrastructure broke because each IAM principal has a principal ID that is unique. Recreating an IAM role with the same name still creates a principal with a new unique ID. An innocent change like that can break references to the role in unexpected ways that are hard to fix.

Changes like the one highlighted in the scenario are common, and it is critical to understand what consequences your infrastructure changes can have.

Anyshift can help you catch these types of mistakes. To get started visit anyshift.io and sign up for a free account.

Articles by

Mattias Fjellström

Accelerate at Iver Sverige

Cloud Architect | Author | HashiCorp Ambassador | HashiCorp User Group Leader

Mattias is a cloud architect consultant working to help customers improve their cloud environments. He has extensive experience with both the AWS and Microsoft Azure platforms and holds professional-level certifications in both.

He is also a HashiCorp Ambassador and an author of a book covering the Terraform Authoring and Operations Professional certification.

Blog: https://mattias.engineer
Linkedin: https://www.linkedin.com/in/mattiasfjellstrom/
Bluesky: https://bsky.app/profile/mattias.engineer

See my articles

Find me on Linkedin

Back