It’s Time to Retire Terraform

Published in

Real Kinetic Blog

10 min readApr 23, 2024

Terraform exists in many people’s hearts much like a friend or a loved one, or maybe even an enemy. Whether it’s your job to maintain the Terraform configuration for the entire organization or you get the pleasure of having to write some every so often, it’s always there, waiting to ruin your day. To those who enjoy it, you may have Stockholm Syndrome. To those who hate it, I feel your pain.

Now, I am not here to disparage infrastructure as code. I am saying there has to be a better way to do it. In fact, Terraform has done the industry a huge service by bringing IaC to the masses. It surely has the lion’s share of the market, but that’s like saying Van Halen made good music because they were so popular (sorry for the stray). Terraform has served its purpose, but it’s time to retire it. Here are the reasons why followed by an alternative approach you should consider.

Bespoke patterns

Each Terraform configuration and module are uniquely structured and different from each other. One team might use symlinks and complex folder structures to create a way to manage the promotion of changes across environments (we will get to this next), and another team might choose to shove everything in one file. Variables, outputs, and versions have the option of being their own files, spread across files, or bundled all in the same file. There are some patterns and conventions, but they are more akin to light suggestions. Modules are a whole other beast with their own idiosyncrasies.

Support for multiple environments

I think we can all agree that there are legitimate use cases for managing multiple instances of the same configuration with variants beyond environment variables. Again, it is possible to work around this via a complex directory and branching structure, but it is infrequent that teams understand the patterns for managing promotions and releases through their environments. Terraform Enterprise helps somewhat, but it still often feels like a gigantic hack and can be quite confusing to follow along. The level of inconvenience out of the box feels almost intentional.

Drift management

I have a problem with the API for supporting drift, which is to say there is none. Hopping into the state file, where secrets are often stored in plain text and modifying structure or values by hand is not a solid security story. It’s a huge risk! It isn’t something talked about enough. Addressing drift or state inconsistencies often entails killing your entire day to work through complex workflows: remove the resource from the Terraform state, promote through your environments, remove the references from your code, apply the configuration again through each environment, add the resource back into the code, and apply the configuration again to each environment. The inconsistencies between the expectations from plan and apply make this all the more exciting.

HCL

HCL is an interesting “language.” It sits somewhere between a configuration specification format and a programming language. It is pseudo-declarative. You have some forms of flow control, but it isn’t always easy to predict behavior. You can’t write tests for the complex logic that creeps in, and it might not even be obvious that complex logic has in fact crept in. There is generally poor tooling available to support development within the ecosystem. Commands such as “validate” check for basic syntax problems and little more.

It should be way easier to read and write your infrastructure as code — it is critical, has a high blast radius, and is generally harder to roll back. Consider the way the loops and indexes work. It is complex and you’ve got few means of validating the behavior. The challenge of maintaining the code bases is exacerbated because resource references routinely end up with resource names that are 100 characters long because the name needs to include the type of the resource since it needs to be unique and descriptive enough to provide context.

Over time, you might want to refactor your code. That might happen because you learn more about how to structure your modules. It might happen because your Terraform configuration is growing and evolving. It might happen because you want to make things more modular. There are lots of valid reasons for refactoring a code base, and if you’ve ever refactored a large Terraform configuration, you know exactly how challenging this can be.

Ignoring non-deterministic APIs

This is less that the modules themselves aren’t good, but rather the framework doesn’t allow for consistent patterns for consuming APIs. What I mean by this is, when I am consuming a Terraform module, besides invalid syntax and terraform plan issues, what are the ways this can fail only at apply-time? It is difficult to do due diligence without understanding the apply failure modes. I am not asking for magic beans here that can look into some non-deterministic result and predict what will happen. I just want consistency and documentation provided by the module to tell me the likely failures — and the catastrophic failures — that could happen if certain conditions are not met. If the iteration cycle on a failed apply command wasn’t so frustrating, maybe this wouldn’t be an issue.

“Cloud agnostic”

At best this benefit is overstated. You use the same HCL language and even the same Terraform CI/CD tooling regardless of your cloud platform. Saying it’s “cloud agnostic” is akin to saying YAML, Python, or Java are cloud agnostic — they are indeed able to run on any platform. The reality is that every cloud platform provider has its own idiosyncrasies and nuances.

You still need to learn and understand all of these nuances across each platform’s provider. Often you will also require a deeper understanding of the cloud platform in order to debug apply issues or configuration mistakes. That’s because Terraform introduces a layer of abstraction that’s masquerading as agnostic, so you might not have an API that maps cleanly into the infrastructure’s more natural way of thinking.

Collaboration and permissions

This relates to the drift management and HCL sections, but specifically is centered around administrative tasks. Terraform often feels like it was designed to be managed by a TF czar who has the old school gatekeeper ops mentality. Of course, as is typical for ops teams, the TF czar has extremely elevated permissions. Enforcement of an SDLC or permissions often requires additional tooling or more complicated repository structures. It certainly will require careful thought and planning up front.

As an example, consider running the import or export commands. That’s not something that should be run by a developer who has never done it. In fact, all commands should generally be done only through CI/CD tooling, however it takes the average company quite a bit of time and investment to get the CI/CD tooling into a good enough spot to only be run in an automated fashion. So what ends up happening is companies either spend that large investment making these admin commands self-service or they scale their ops team to be the Terraform czars.

Which ties into the collaboration story. In order to support collaboration, you need to structure your code into smaller infrastructure stacks. The more people who are working against the same Terraform configuration, the more collaboration challenges you’ll face due to conflicting changes and state conflicts. How often have you attempted to promote changes only to discover that someone else left the testing (or production!) environment in a semi-broken state that prevents applying the updated Terraform?

A huge downside to splitting the infrastructure configurations up is that it often results in various hard-coded references to components from other configurations propagating around. The net result is that, while in theory you could redeploy to a new environment, in practice it might be a massive undertaking requiring applying many configurations over and over and over until it is made consistent. That’s painful because the tooling used is often not designed to be used in that way. Most organizations rely on Terraform as a core component of their disaster recovery strategy. The theory is that we can spin up a new environment in a matter of minutes because the infrastructure is declaratively defined. The reality, however, is often large and complex configurations can take days to get fully applied and working, especially when few organizations actually run through their DR plan on a routine basis.

Terraform works best when there are one or two folks managing a stack.

Licensing drama and community fragmentation

It’s incredibly difficult to build a sustainable business centered around open source products. This is particularly challenging for VC-funded startups which raise hundreds of millions in funding because the expectation is “unicorn or bust.” An IPO might mean a big exit for investors, but for the actual business, it’s a long hard road to finding profitability. HashiCorp did a lot to further IaC and move us toward a more declarative model of managing infrastructure. Their tooling achieved broad adoption in part by being open source and allowing other companies to build around it, extend it, and offer support. Few products ever achieve the type of success they have seen in terms of community support. But now they find themselves on that long hard road to profitability.

Last year, HashiCorp announced it is moving from a permissive open source license to a more restrictive Business Source License for its products. And while they have every right to do this and it is likely a necessity to deliver the financial results investors expect, it doesn’t change the fact that it is a renege on what led to their success in the first place. It’s easy to sing Kumbaya about open source until there’s a big pile of cash at stake.

The truth is this licensing change currently does not affect the majority of Terraform users. It’s more aimed at companies creating competing products around Terraform. Nonetheless, it opens a can of worms that has put the Terraform community into a fervor. It’s unclear what the ramifications will be, but it is almost certainly the beginning stage of HashiCorp’s transformation into an enterprise software company. It has already led to a community fork of Terraform, OpenTofu, which seeks to be an open-source alternative. The fragmentation has begun.

And another thing…

Many of the issues cited above are related more closely to how Terraform is implemented and managed at the organization using it. However, most companies don’t initially know how to use it and accidentally proliferate bad patterns internally because they copy them from project to project. It is only after being in service for some time that patterns that work for a given organization and its tooling emerge. That’s where some of the above struggles become extremely problematic and challenging, resulting in something of a stuck commitment to it. A good framework should make it hard (or at least painful) to use it incorrectly. It shouldn’t require a paywall to get rid of the need for hacks to resolve fundamental challenges, and it should have good patterns that can be seen from company to company. It should allow organizations to restructure or refactor their infrastructure later without causing major headaches.

A suggestion

A rant without providing an alternative is just a rant — now I can call this a “call to action.” An approach that is catching on is the Kubernetes operator pattern. Each of the major cloud providers offer operators which allow you to manage your cloud infrastructure the same way you manage your Kubernetes applications: GCP Config Connector, AWS Controllers for Kubernetes (ACK), and Azure Service Operator. Many infrastructure vendors have followed suit, such as Mongo Atlas, Confluent Kafka, and Elastic.

We have been using Config Connector and couldn’t be happier. In fact, we’re routinely impressed with the overall stability and the debuggability of our infrastructure. If you have existing patterns for templating and deploying Kubernetes YAML files, managing your infrastructure will be similar to managing your other service deployments. We leverage Config Connector extensively in our enterprise integration of GitLab and GCP called Konfig.

These operators use a control loop to periodically reconcile your resources which automatically corrects drift. Resources are specified and configured in YAML and they’re able to use resource references in a way that scales as your project grows. You have access to tooling designed to support more complex needs and use cases. You have the benefit of Kubernetes built-in RBAC controls for security restrictions and enabling sane self-service models. You have a way to enforce resource standards and provide developers with defaults across your organization (rather than relying on policy scanners like Checkov). Finally, when desired, it allows you to manage service-specific infrastructure alongside your application code in the same project, without fears around refactoring or moving the resource definition to a higher level if you need to share it with another service. It isn’t perfect, but after using this model it feels like an overall improvement and a big step in the right direction.

Let’s wrap it up

In conclusion, while Terraform has played a significant role in popularizing infrastructure as code and simplifying deployment processes, its shortcomings are becoming increasingly apparent. From bespoke patterns to drift management issues, Terraform presents challenges that hinder efficient infrastructure management and collaboration within teams. Moreover, recent licensing changes and community fragmentation add uncertainty to its future.

However, this critique is not without a proposed solution. The Kubernetes operator pattern, exemplified by offerings like GCP Config Connector, AWS Controllers for Kubernetes (ACK), and Azure Service Operator, presents a compelling alternative. By leveraging Kubernetes’ control loop and YAML configuration, these operators streamline infrastructure management, automate drift correction, and improve collaboration. While not without its own complexities, adopting this approach represents a promising step forward in addressing the limitations of Terraform and advancing infrastructure management practices.

Ultimately, as technology evolves and new solutions emerge, it’s imperative for organizations to critically evaluate their tools and methodologies to ensure they remain aligned with their goals and objectives. Transitioning from Terraform to the Kubernetes operator pattern may offer a pathway towards more efficient, scalable, and resilient infrastructure management in the modern era. If you need support migrating away from Terraform, we can help.