04 Oct 2022

Infrastructure as code

You've heard it everywhere. You know it's a good thing. Versioning! Replicability! It's all great.

Of course you want to have everything in code somewhere, in files. Now what would you use for that?

A silly drawing of Superterra

I think if you use AWS you can do a lot worse than terraform, and this is how I would structure it, based on my very extensive experience managing a big and complex multi country, multi environment setup.

Let's start

You start reading about terraform and will end up with a bunch of files that define your environment. You will be importing 3rd party modules maybe, you will maybe creating your own modules. But most likely you will have a big folder where all files depend on each other.

This will start biting you in the butt:

  • Applying will take longer and longer, as terraform has to match the resource ids in local with everything in AWS
  • All your users (admins) will be able to do everything, and even though Terraform is quite good at allow multiple users using the same sync backend, some times what you will be working one will be removed (or flagged for removal) by some other user's apply
  • Terraform version changes mean a lot of changes and you will be hesitant to apply them as they need to be applied system-wide

There are other approaches I won't go into here, like terragrunt.

First step, take control of terraform itself

Install tfenv.

Create a .terraform-version file and versions.tf file

1.3.1
terraform {
  required_version = "~> 1.3"
}

The first file is used by tfenv to detect and install the correct version. You can use tfenv use to switch to it (and tfenv install to install it).

The second file is used by terraform to stop processing if the version requirement isn't matched, or to switch to old syntax compatibility if requesting a lower version.

How best to split resources in terraform files?

Let's abandon the idea of having it all in one state. Accept defeat and let's split it into sub modules that can be applied separately. This is how:

Big caveat: this is just my proposal based on what's been working for my needs and the needs of my company and colleagues so far, maybe we will change it tomorrow but here it is.

  • Create a folder for your modules (shared code)
  • Create a folder per each environment + region combination (staging-us, prod-us, prod-eu, ...)
  • Create inside each:
    • a folder per 'business area' or 'team' that are likely to need to work in parallel and are relatively independent
    • a folder for 'staff' where you will define all the auth data/groups etc - this could live in a completely separate place altogether if you don't want to leak user names and who can do what
  • Create also a file with shared values (shared.tf)
  • Create provider.tf, variables.tf, terraform.tfvars, versions.tf in each of the areas, most of this will be done by terraform initial

Now in each of these area folders:

  • Create a symbolic link for .terraform-version, provider.tf, variables.tf, terraform.tfvars, versions.tf in each of the areas(*)
  • Create a backend.tf file in which you declare that the backend for terraform's state will be in s3 in a bucket with a path that includes the name of the area
    • Now you have a way to prevent teams from accessing each other's state
    • You can of course use any other backend, the core idea here is to use something you can restrict access to.

(*) This is not strictly necessary but it will benefit you in the long term: make sure you use the same version of terraform everywhere and have a common set of 'global variables'. Using the same version becomes extremely important when modules are concerned as they are shared code.

Now here's how I go about gluing everything together, and believe me we've tried many approaches:

  • First: try to minimize as much as you dependencies between teams
  • Where these are inevitable, like for instance the list of VPCs and subnets, you define them in the shared resources or variables

An example backend.tf would look like:

terraform {
  backend "s3" {
    bucket  = "super-important-status"
    key     = "prod-us-1/terraform/authservice"
    profile = "my aws profile name for this region"
    region  = "us-east-1"
  }
}

An example shared.tf

locals {
  // These values are retrieved from the main module using 'terraform state show' and related values
  prod_route53_zone_id          = "ZXXYYXXYYXXYZZ"
  private_subnet_ids              = [
    "subnet-0aaaaaaaaaaaaaaaa",
    "subnet-0bbbbbbbbbbbbbbbb",
    "subnet-0cccccccccccccccc",
    "subnet-0dddddddddddddddd",
    "subnet-0eeeeeeeeeeeeeeee",
  ]
  public_subnet_ids               = [
    "subnet-0aaaaaaaaaaaaaaaa",
    "subnet-0bbbbbbbbbbbbbbbb",
    "subnet-0cccccccccccccccc",
    "subnet-0dddddddddddddddd",
    "subnet-0eeeeeeeeeeeeeeee",
  ]
  main_vpc_id = "vpc-aaaaaaaaaaaaaa"
  main_vpc_cidr = ["172.29.0.0/16"]
  vpn_cidr = ["172.27.0.0/22"]
  vpn_sg = ["sg-xxxxxxxxxxxxxxxx"]
}

Note that having to do this is ugly but realistically the number of times you are going to recreate your VPCs once you have a production (or even dev) setup going is about 0. And if you ever do all you'll have to do will be to update your shared values an reapply all areas.

This leaves us with something like:

[infrastructure]
  |
  -- [modules]
  |  |
  |  -- [mysharedcode]
  |  |  |
  |  |  +- variables.tf
  |  |  +- outputs.tf
  |  |  \- main.tf
  |  |
  |  -- [myothersharedcode]
  |     |
  |     +- variables.tf
  |     +- outputs.tf
  |     \- main.tf
  |
  -- [prod-us]
     |
     -- [terraform]
        |
        +- .terraform-version
        +- versions.tf
        +- terraform.tfvars
        +- variables.tf
        +- provider.tf
        +- shared.tf
        |
        -- [staff]
        |  |
        |  +- .terraform-version --> ../.terraform-version
        |  +- versions.tf --> ../versions.tf
        |  +- terraform.tfvars --> ../terraform.tfvars
        |  +- variables.tf --> ../variables.tf
        |  +- provider.tf --> ../provider.tf
        |  +- shared.tf --> ../shared.tf
        |  +- users.tf
        |  +- permissions.tf
        |  \- backend.tf
        |
        -- [frontend]
        |  |
        |  +- .terraform-version --> ../.terraform-version
        |  +- versions.tf --> ../versions.tf
        |  +- terraform.tfvars --> ../terraform.tfvars
        |  +- variables.tf --> ../variables.tf
        |  +- provider.tf --> ../provider.tf
        |  +- shared.tf --> ../shared.tf
        |  +- s3.tf
        |  +- iam.tf
        |  \- backend.tf
        |
        -- [dataeng]
           |
           +- .terraform-version --> ../.terraform-version
           +- versions.tf --> ../versions.tf
           +- terraform.tfvars --> ../terraform.tfvars
           +- variables.tf --> ../variables.tf
           +- provider.tf --> ../provider.tf
           +- shared.tf --> ../shared.tf
           +- s3.tf
           +- other.tf
           \- backend.tf

In the particular case outlined above:

  • Terraform will create a folder 'prod-us-1/terraform/dataeng' inside the super-important-status bucket
  • Inside, it will keep the state of terraform, this is a mapping between resource IDs in AWS and resource IDs in terraform
  • It will use a profile you will have defined in your ~/.aws/credentials file

Final thoughts

This setup's main aim is to

  • Help with auditing, scope limiting, user creation
  • Allowing to work in parallel and have 'half-applied' terraform objects while one team is deploying a change or simply experimenting
  • Protect developers and admins from themselves

All without adding layers that would make infrastructure-as-code become 'infrastructure-as-code that-you-modify-with-some-tool'.

To reap the full benefits you now will create users in your staff folder and permission policies. You then add your users to groups or policies accordingly. The policies will allow access to some of the S3 status (eg allowing mybucket/mystatusprefix/terrafor/dataeng/*).

Fair warning: however much you try, the users able to apply terraform will probably end up with too many rights. It is extremely difficult to narrow permissions. The only way is to start with nothing, try applying, add new permissions, etc. From time to time your users will need to do something not allowed and they'll see the rejection, just make it a process that they ask you/admins to review.


(Follow me on Twitter, on Facebook or add this RSS feed) (Sígueme en Twitter, on Facebook o añade mi feed RSS)
details

Comments