Terraform is an incredibly powerful tool for managing infrastructure using code files. It offers flexibility and automation but can be overwhelming for both newcomers and experienced users. In this guide, I will share my most tried and tested tips to help you navigate the complexities of Terraform and maximize its capabilities.
This article is not your typical "best practices" guide that repeats what you've already read countless times. Instead, it aims to provide fresh insights and encourage you to think critically about your unique situation, so you can make the best decisions for your needs.
I had the opportunity to talk about these IaC security best practices in this webinar:
So, without further ado, let's dive right in!
1 Clean Code
Every tool or programming language has its limitations, and Terraform is no exception. One limitation that existed before Terraform 0.13 was the inability to use for_each
modules. It wasn't until August 2020 that HashiCorp introduced the capability to loop over modules with a single module call.
Once you understand and accept these quirks and features, you can leverage various best practices to organize your code and optimize its use. Although Terraform is not strictly a programming language, many of the same principles of writing code apply to Terraform as well.
Before we dive into Terraform code, let's take a quick look at coding or programming in general.
1.1 "Human" code
I want to start this conversation by quoting Knuth:
Programs are meant to be read by humans and only incidentally for computers to execute.
The computer has no problem with ambiguous variable names, extended functions, or a single file of thousands of lines of code. It will still execute properly. All the methodologies and ideas like refactoring, clean code, naming conventions, modules, packages, code smell, etc., are invented so that we humans can read the code better, not computers can run it better.
1.2 Code, Evolving Code
Programs evolve. Code changes.
It's rare that you finish a piece of code and leave it there for the rest of your life. That's not how projects work. If that was the case, we wouldn't be talking about Terraform best practices: you would only use it once anyway.
It's normal that when we are at work, we have projects all the time. The time when there isn't any project is scarce. Because business wants to improve, and the project is the way to move from the current state to the next desired state. Changing from one state to another is a "project". By nature, "project" means change, and the code is also changing constantly.
1.3 Writing Clean Code
In order to make the change more manageable, we write clean code.
We limit the length of the line width because humans are not good at reading very wide lines of words; we carefully choose the names of the variables so that we immediately know what they mean the next time we read them; we try to reduce the length of functions because shorter functions are not only easier to test but also easier to understand; we try to split a file with thousands of lines of code into smaller chunks.
Computers don't care about any of these, at all. Be it one large file or ten smaller chunks of files, it will run. Clean code makes it easier and faster to read, understand, and build upon.
By now, you may have already guessed our objective. We want to adopt practices that make our code easier to read, manage, and modify.
1.4 Terraform Comments Best Practices
One thing that can greatly enhance the maintainability of your Terraform code is comments. When coding, we often have implicit contexts in mind that would greatly benefit from being made explicit for our future selves or fellow coders. Here are some recommendations regarding comments:
- Explain complex or non-obvious logic: Terraform does support comments, so use that feature to provide a clear understanding of what the code is doing, especially when dealing with intricate or less intuitive logic, or when the context isn't clear.
- Another terraform comments best practice is to document resource dependencies: if there are dependencies between resources, document them using comments. This will assist users in understanding the correct order for creating or destroying resources.
- Finally, always keep comments up to date! Outdated comments can be misleading and cause confusion.
2 Know Your Stuff
2.1 Play with It First
First, let's be clear: for people who are new to the cloud or new to a specific service in the cloud, I don't recommend using Terraform as the first attempt to create that resource. This is true no matter which IaC tool you are using, not specific to Terraform.
Instead, go to the console, read the official documents and FAQs, figure out what parameters are mandatory and what are optional, and try things manually. This would definitely help. Don't worry because some big guy in the DevOps world told you that "the moment you click a button in the console, you create technical debt." Forget about it. Get comfortable with it first.
Once you grasp the keys of the resource you are going to create, you can automate it using Terraform.
Do you consider yourself an experienced Terraform user? This applies to you, too! Here is a story: AWS releases a new service or a new resource. You haven't used it yet. You tell yourself: "I'm a veteran; only noob plays with the console. I'll just get started with a perfect Terraform module right away. Why bother playing with it in the console first anyway?" Then after hours of working and debugging, you find out that you are stuck only because of a small parameter or configuration of the resource which you didn't figure out clearly in the first place. Sounds familiar?
2.2 Know Your Infrastructure
You need to know what exactly is created and managed by your Terraform code.
This is especially important when using third-party modules because there are so many parameters and different use cases, that it's hard to know exactly which scenario to pick, what resources will be created, and what values to set for those bunch of parameters.
Oftentimes, when I need to provision some resources in the cloud using Terraform, I find that I can do it quicker if I write the resources and modules myself (of course, I could also re-use modules I wrote before) than finding a third-party module off the internet because a lot of third-party modules are heavily future-proofed: they try to solve everybody's problem with the same module: they are doing it the "monolithic" way.
Here I don't mean to blame anybody, but for example, if you try to search a module, like EKS, from the Terraform registry, you will find out that it has a whopping 62 input parameters. If you want to create an EKS cluster in an existing VPC, using self-managed worker nodes, with certain launch templates, what parameters to set? Have fun figuring that out.
Sometimes, "do not reinvent the wheel" is the better way to go: Terraform isn't easy to get started, but once you get fluent with it, it's relatively easy to use. Creating a resource or a module isn't rocket science. You can manage it within a reasonable period of time. Weigh the advantages and disadvantages of using third-party modules before you decide.
3 The Myth of Future-Proof
Instead of attempting to implement multiple features that you may need in the future, focus on implementing only the feature you currently require. Trying to future-proof or over-future-proof your code prematurely can result in messy code. This principle holds true regardless of whether you are using Terraform or any other Infrastructure as Code tool.
Many "best practices" would tell you to never use a local backend, always use a remote backend, run your Terraform from within a CI tool, always use modules, etc.
I'm telling you none of those.
Because there is no "one-size-fits-all" answer; it all depends.
Consider a simple scenario where you are working on a minimum viable product (MVP) or a proof of concept (POC). In such cases, it may not be necessary to invest time in creating a remote state with state lock, executing jobs from a cloud-based CI running in Kubernetes, or creating excessively small modules just because others claim it as a 'best practice.' Instead, focus on the specific requirements of your project and avoid unnecessary complexities that may hinder progress.
Anticipating future needs is challenging, and even if you invest time in creating a meticulously designed and future-proof module with efficient remote state management, it is likely that when the time comes to utilize it, you may still need to refactor it. But don't get me wrong: creating a module isn't hard.
When attempting to future-proof your code, you often find yourself writing if-else statements, conditions, and branches to accommodate multiple scenarios. Refactor is all about reducing if-else and simplifying the code. So why introduce complexity when you don't really need it now?
I'm not telling you to give up modules, remote states, and other fancy features. My point is, do create a flexible base that can adapt to possible future changes, but don't waste too much time and energy future-proofing it.
4 Do One Thing and Do It Right
Just like the example given in the previous section, there are too many examples of Terraform modules and code that try to be the "complete package" by supporting every single possible scenario.
Experienced Terraform users might already be familiar with this: to make your module "complete" and useful in every scenario, you use complicated input variable structure, you create even more complicated local variables with short-hand conditions, you even need to use built-in functions to merge multiple variables as one so that if one variable is empty, you can still get the value from another variable and no exception would be thrown.
For starters, when you look at the code like this, it's not "declarative" anymore, because when you read something as complicated as that, you can't really know the description of the infrastructure that you are going to create with that code, you can't know what value is being set to this specific parameter.
Maybe writing a module for a specific scenario isn't that bad. When you have a slightly different use case, create another module. This might generate duplicated code, bringing me to my next point.
5 The Art of Finding the Balance: DRY vs Readability
DRY means Don't Repeat Yourself, and this principle is loved by many programmers.
Yet, you must also find the right balance between "duplicated code" and "readability." This is also true for any programming language because code is for humans to read.
When you want to achieve two things in one piece of code, you will need extra input parameters. You will need if-else. You will need to generate various outputs too. Adding too many features into one piece of code will invariably reduce the readability because brains are not so great with if-else and parameters.
On the other hand, you can choose to have two pieces of code for two slightly different features, with both having straightforward logic flow and being easy to read, but in this way, you will probably have some duplicated code. Using the right technique, for example, extracting a similar part out and creating a small module for it (if it will be commonly used) might be an answer.
Finding the right balance between duplicated code and readability is an art that requires experience to be made perfect, and only you can decide for yourself. "You must have less than 10% duplicated code" or "reuse modules as much as possible" are simply not pragmatic or helpful suggestions.
6 Separate Infrastructure from Configuration
6.1 A Story
I was in a project where we used IaC to create Kubernetes clusters. On top of the Terraform code, we also used the Terraform Kubernetes provider to install some required components inside the cluster.
So far, so good, because Terraform is idempotent (more on that later) by design.
The thing is, if a certain resource already existed in the cluster (like a ConfigMap), the Kubernetes provider couldn't "upsert" (update OR create) it, and the plan would fail.
The provider would break the idempotency.
This is an example of why you want to separate your IaC part from the configuration management part, because not only does it make sense logically, but also it reduces complexity.
In the example above, the solution was to use Terraform to only create the cluster (and nothing else), and then delegate the resource "upsert" to the CI/CD tool with a kubectl apply
execution. So it's important to understand the crucial distinction between Infrastructure as Code (IaC) and Configuration Management (CM).
6.2 What's Infrastructure as Code (IaC)?
Infrastructure as Code (IaC) manages infrastructure in a descriptive model:
- It uses code files as the definition rather than interactive tools.
- It tries to achieve 100% automation.
- It doesn't matter if you run your own data center or you use the public cloud.
You write code to manage your networks, servers (physical servers or virtual machines), connections, connection topology, load balancers, etc.
6.3 What's Configuration Management (CM)?
Configuration Management (CM), on the other hand, maintains computer systems, software, dependencies, settings, etc., in a desired, consistent state.
Let's consider a physical data center scenario: purchasing servers, putting a newly purchased server onto a rack, connecting networking cables to the switches so that it's connected to the existing networks (or launching a new virtual machine and assigning network interfaces to it) are tasks that fall under the umbrella of "infrastructure" and are typically handled by specialized teams.
However, once the server is launched, the focus shifts to configuring it to run specific software. This includes tasks such as installing an HTTP server software and configuring it to meet specific requirements. This aspect of managing the server's software and settings is known as "configuration management." Interestingly, this responsibility can be assigned to a separate team that doesn't need to be concerned with the underlying infrastructure, as their primary focus is on software configuration rather than infrastructure management.6.4 The IaC and CM Separation
In the real world, things are not as simple as the "black or white" example above because we have many different tools and technologies allowing us to do Infrastructure as Code, configuration management, or both at the same time.
For example, although Terraform is considered an IaC tool, it can perform configurations and installations on certain servers. Although Ansible is considered a configuration management tool, it can launch virtual machines.
Finding the right boundary for you, figuring out which part you would like Terraform to manage, and how Terraform interacts with your choice of CM tools is crucial, especially for large projects.
In an ever-changing world, the entropy in your system is only increasing. In the long run, you will benefit greatly from "simplify" and "do one thing and do it right."
7 Make Your Terraform Code Idempotent
Idempotent means no matter how many times you run your IaC and, what your starting state is, you will end up with the same end state.
The same principle applies to configuration management too.
Why do we need idempotency?
Idempotency is nice to have because infrastructure and configuration are not getting simpler as time goes on. Even if you just started fresh, you will handle complicated situations quickly. Idempotency simplifies the provisioning of infrastructure and the management of configurations, reducing the chances of inconsistent results.
For example, you need to set up A, and then set up B after A is finished. If setting up A fails, you want to re-run your automation so that it can retry setting up B without trying to create A again (if it tries to create A again, it will fail because A already exists).
How to make it idempotent? Read on.
8 Make Your Terraform Code Declarative
To achieve idempotency, a declarative style of code is preferred in most cases.
Declarative means defining the final state you want to have, rather than what command to execute in the code.
For example, you want to install an HTTP webserver. The task should be described as "ensure an HTTP server is installed" (i.e., if the HTTP server isn't installed, install it; if already installed, do nothing), instead of "run this apt command to install the server."
When you look at your Infrastructure as Code, it should be like reading a document, a description of what you will have if you run this code, no matter how many times you run it.
When writing your infrastructure code or even creating a Terraform provider, you need to have "side effects" in mind. If this part runs a shell command or script, what happens if I run "terraform apply" again?
9 Forget about Cloud Agnostic / Vendor Lock-in
This might be controversial, but I'd like to make it clear: Terraform isn't "Cloud-Agnostic", and vendor lock-in doesn't matter (at least it's not as much as you might think.)
In many cases, people are fighting hard to avoid vendor lock-in. Because we want to have a "backup plan" if things don't work out nicely with the current vendor. We want to have the option of moving to another vendor with as little trouble as possible.
It's not the case in the real world.
When you buy servers in bulk, you probably sign multi-year contracts with the vendor for a better price. When you are using the cloud, you rarely decide to move to another cloud. Having a multi-cloud setup, maybe yes, but migrating from one cloud to another isn't common, although situations like that do exist.
Even if you want to use Terraform to manage your AWS resource because you might want to move to GCP or Azure in the future, and you know Terraform works with GCP and Azure, in reality, you can't re-use your code. It goes without saying that if you want to switch from one cloud to another, you need to rewrite all your Terraform code: different cloud has different Terraform providers, and their resource name and parameters differ greatly.
Admit it or not, you are vendor-locked in, one way or another.
Once you are clear of this, it's, in fact, easier for you to choose the right tool for the job: because you are not afraid of vendor lock-in anymore, and you don't put it as the top priority one when making comparisons. Instead, you start to see the features, advantages, and disadvantages of each choice. For example, if you already use Terraform with AWS, but for this specific piece of infrastructure, it might be even easier to use AWS CDK or some other tool (for example, eksctl
to create a K8s cluster), why not?
Summary
While Terraform may initially seem daunting with its learning curve, once you master it, you gain the flexibility to effectively manage your infrastructure.
It is important to acknowledge that there can be pitfalls and issues if Terraform is not used properly. However, with careful attention and ongoing refactoring, it becomes manageable, and your code can remain clean and easily readable. These 'best practices' can greatly assist in achieving these goals. Let's sum them up:
- Play with the cloud console and understand the resource before automating it with Terraform.
- Understand what your Terraform code creates and manages, especially when using third-party modules.
- Be cautious and take security measures to prevent mistakes in your Terraform code, like scanning for misconfigurations.
- Avoid creating overly complex and monolithic Terraform modules.
- Separate infrastructure management (IaC) from configuration management (CM) to reduce complexity.
- Use a declarative code style to achieve idempotency and clarity in your Terraform code.
- Don't worry too much about vendor lock-in, as it is often unavoidable in practice.
- Choose the right tool for the job based on features and advantages, rather than fear of vendor lock-in.
Are you interested in learning about Infrastructure as Code (IaC) security? Check out this cheat sheet and learn more in its accompanying blog post.
Let's move on to the next step!