Real Kinetic Blog

Our thoughts, opinions, and insights into technology and leadership. We blog about scalability, devops, and organizational issues.

Follow publication

Terraform & Infrastructure as Code best practices

--

Building and maintaining infrastructure, especially in the cloud, is becoming more and more complex. Infrastructure as Code (IaC) has become an essential part of managing that complexity. We at Real Kinetic have worked with many teams to help implement and maintain large deployments across AWS and GCP.

Terraform

Both AWS and GCP come with their own flavors of IaC — CloudFormation and Cloud Deployment Manager, respectively. Both have their pros and cons, but we have found that HashiCorp’s Terraform is the simplest, best documented, and most widely supported. Many of our clients find Terraform to be the best option.

Repository structure

When maintaining infrastructure through Terraform, we recommend that a two-repo structure is used. The first is the modules repo. This is where the blueprints of the infrastructure are stored. This is a shared repo where product and operations teams would contribute their infrastructure definitions. Standard SDLC applies to this repo, i.e. pull requests, code review, tagging, and releasing.

The second is called the live repo. This repo references the code stored in the modules repo and stores the variables used to build the infrastructure for each environment.

An example of what these repositories might look like:

modules

networking/
main.tf
vars.tf
output.tf
kubernetes/
main.tf
vars.tf
output.tf
data-stores/
sql/
main.tf
vars.tf
output.tf
services/
my-app/
main.tf
vars.tf
output.tf

live

prod/
networking.tfvars
kubernetes.tfvars
data-stores/
sql.tfvars

pre-prod/
networking.tfvars

dev/
networking.tfvars

Each Terraform module defined in the modules repository is a referenceable, reusable resource. We have seen companies use a single modules repo or split them out so that each team can control an aspect of the infrastructure, e.g. networking vs. data team vs. application team.

In the live repo, each workspace (or environment in legacy terminology) is defined with reference to the set of modules that make up the environment. Typically, dev/pre-prod and prod are copy-and-paste equivalents. Note that the live repo does not contain any *.tf files, just *.tfvars that contain references to the required modules and defined configuration variables to build and deploy the infrastructure for that workspace.

Terraform state

Each workspace requires an independent state. This state should persist between plan/apply cycles as it represents the known configuration of the infrastructure the last time Terraform was run. Also note that this state can contain sensitive information like service keys and database passwords, so it should be encrypted at rest and transport. We typically recommend using GCS/S3 to maintain these state files. For pre-prod/prod environments, we recommend that only CI systems have access to the state files for auditing and compliance purposes.

Infrastructure change best practices

Changing infrastructure without impacting the people using the system can be a complex problem. Most of the advice given here is to help facilitate the roll back of the infrastructure to a known good state if an issue is detected.

We have found that clients who have had the most success with Terraform (or any IaC) follow these best practices:

Immutable infrastructure

Immutable infrastructure is the practice of creating new infrastructure instead of changing in place. An excellent write-up on the differences and trade-offs of immutable infrastructure is available here.

One common question is how to handle stateful workloads. Typically, we suggest that the data/state is externalized as much as possible (e.g. avoid writing to local disk) and to use managed services to store data such as Cloud SQL or Datastore.

Minimize the blast radius

Large changesets inherently carry higher risks of failure, and this is especially true when dealing with infrastructure. Ensure that each proposed change is small. Need to add additional subnets for a specific VPC? That is one changeset. Make a VPN connection from on-prem? That is another, separate changeset. The idea here is to minimize the blast radius of what can be impacted if the change were not to apply correctly. A clear understanding of the impact on the existing infrastructure if the change fails is critical not to cause a potential production outage.

Schema changes

When changing infrastructure, we typically recommend a similar approach to database schema migrations. Additive changes are more straightforward and have the least impact if they need to be rolled back. Changes that require complex modifications with types or renaming columns are riskier, and rollback precautions must be scrutinized even further.

The same patterns apply to infrastructure. Additions are best because changing existing infrastructure carries more risk.

Scenario: Migrating an application from MySQL to MongoDB

Let’s say that an application team needs to move from a traditional RDBMS to a NoSQL database. Here are some of the individual changesets (separate PRs) to the infrastructure to achieve this:

  1. Create the infrastructure required to run the new database. Note at this point, the application is not using this database at all. This is an additional step, and rollback should be pretty straight forward.
  2. Deploy a version of the application that writes to both the old and new databases. This step might be a separate process to follow the commit log of the old database. Note that the application is ONLY reading from the old database.
  3. Deploy a version of the application that writes to both but reads from the NEW database.
  4. Deploy a version of the application that only writes to the NEW database.
  5. Delete the old database instance. This step assumes that no other tenants are using the database instance.

Each of these steps can be completed individually. The order does matter, but you can manage the risk of each change to the infrastructure (and therefore your customers) much more carefully.

This approach requires a lot of coordination and communication with the application team, but they are the ones who have the most context on the implications of the potentially affected users at each stage. This also allows changes to be more safely rolled back in the event of an issue. Though coordination is required, the infrastructure changes are decoupled from the application changes. So while the order must be correct, the timing is less important.

Many details have been left out here on exactly how this can be approached. For example, how to do an initial migration of the data from one to the other and how to handle transactional changes. The steps listed above are meant to be an outline of the strategy to employ when making complex, mutating changes to existing infrastructure.

Code review

We recommend that all Terraform changes go through an SDLC process which includes a proper, required code review step. Proposed changes should be reviewed by someone who was not involved in the authoring of the changes.

Make sure that terraform plan runs against the prod environment. The output of this command gives you a concrete understanding of the changes that are going to be applied. If this step fails, make sure the CI build fails and is reported appropriately on the code review.

Promotion to production

When a change is ready to be promoted, a PR must be made against the master branch in the live repository. This PR must go through some scrutiny looking at:

  1. The impacts of the change, including the resizing of infrastructure. This is typically a destroy and recreate, which can have a massive impact on the traffic hitting that workspace.
  2. A thorough review and understanding of the terraform plan output.
  3. Rollback steps if the changes fail.
  4. Disaster recovery for the worst-case scenario.
  5. Make sure that the dev teams whose applications are going to be affected are aware of the change so they can express concern (or otherwise). Typically, we find these are the folks who proposed the change in the first place.

Once the change has passed the code review stage, it needs to be merged to master and pushed through the promotion process.

The promotion process typically upgrades the infrastructure of the workspaces in a pre-defined manner. This is typically dev -> pre-prod -> prod. Tools we see being used to achieve this include:

  • Jenkins pipeline
  • CircleCI (and other SaaS products)
  • Spinnaker

Pipeline workflow

The master branch of the live repo should always be deployable to production. We refer to this branch as “golden.” Steps for environment promotion might include:

  1. terraform apply to [environment]. Fail the pipeline if this step errors.
  2. Acceptance tests — typically check to see if there is no significant change in error rates from critical services. Usually manual to start with but can eventually be automated.
  3. Manual confirm to review the plan output and move to the next environment. You can remove this step once you have a high level of confidence in the acceptance tests.
  4. Repeat previous steps for all non-prod environments.
  5. For prod-like environments, we typically see clients add a manual confirm before the terraform apply step to start with as they gain experience and confidence in the system.

If any of these steps fail, there should be an immediate investigation and remediation steps taken, ideally a rollback to the previous known good state of Terraform. A reverse commit should be applied to master to ensure that it mirrors the state of production.

There should only be one changeset going through the pipeline at any one time to ensure a rollback can be completed successfully without unintended side-effects. The speed at which a changeset can work its way through to production depends on the complexity, and therefore smaller changesets are better suited to this approach.

Conclusion

IaC/Terraform is a powerful tool to enable you and your teams to define and deploy infrastructure in a controllable and maintainable manner. Implementing these best practices can help you to minimize downtime and allow engineers to focus on their primary job — providing business value.

We would love to help you make Terraform a success at your company. Come and talk to us.

--

--

Published in Real Kinetic Blog

Our thoughts, opinions, and insights into technology and leadership. We blog about scalability, devops, and organizational issues.

Written by Nick Joyce

Cloud herder. Code monkey. Wood worker. Husband. Human. Managing Partner at Real Kinetic.

Responses (1)

Write a response