Scalable Terraform Architecture

Nestoras
Beat Engineering Blog
5 min readSep 15, 2020

--

How to destroy and recreate your whole infrastructure fearlessly

The old days

At Beat, we run all our workloads on AWS, and use infrastructure as code to create and manage our AWS resources. We started with AWS Cloudformation, and two years ago we decided to switch to Terraform 0.11 and started by creating one repository to store all of the Terraform code. We used Terraform modules, so that we can create abstractions and reuse the Terraform module code, and we stored the modules in the same repo alongside our multiple stacks logic. We have multiple stacks because we use one per market, and we call this the island-model since we end up with multiple identical islands in production. Each stack has its own Terraform remote state. We also configured Atlantis to be able to practice GitOps.

In a hyper-growth trajectory

As the company kept growing and the number of engineers increased, we decided to change the engineering org-chart into business units, which we call domains. Each domain has a specific business mission. We also had a monolithic application and decided to migrate to a microservices architecture. More teams and microservices entail more infrastructure to provision. The responsibility of infrastructure provisioning and maintenance at that time was under one DevOps team, which soon became a bottleneck.

Destroying Silos

The ride hailing industry is very demanding and highly competitive. To stay ahead of the competition, you need to move fast and deliver quick results. Our domains need to have the autonomy to build their infrastructure in a few minutes. This is why we decided to create an AWS account for each domain for security, billing, and isolation reasons. With this approach:

  • Our teams have their own AWS console
  • They can manage their resources without conflicts
  • Each domain has its own AWS account
  • It’s easier for us to manage our budgets

We are looking for a Senior DevOps Engineer. Join us!

Addressing the challenges

In our previous setup, when we wanted to deploy a change, Atlantis ran for half an hour to plan our changes. Secondly, our modules weren’t so reusable and DRY, and we didn’t have release management for our modules. Last but not least, most of the product teams lacked the knowledge for our IaC toolchain and we had a bit of a learning curve for learning to use Terraform.

In order to resolve the various problems we implemented a number of things:

IAC Library

Firstly, we defined what modules we need for our infrastructure and created a policy for what a module is.

A module must be reusable and create all the resources needed automatically with only one apply command.

After this investigation we realized that we needed to build up to 70 different modules, so we decided to categorize them and create for each category its own repository(auth, databases, monitoring, etc). Each repository must have the same folder structure, documentation for each module, examples in Terragrunt for each use case and a test folder (we’ll explain in a few paragraphs how we are testing our IAC code).

Categorizing our modules per repository helps us with the release management and keeps a Changelog to keep the why/when/what for these changes. We try to implement all the changes in a backward-compatible manner.

Finally, our engineers are able to select what module they want to provision for their microservices. For example, if they want to create a new database our library provides all the tools for autoscaling, security, monitoring, and backups in a single configuration file of 20 lines of code.

Environments

Let’s take a look into what an environment at Beat looks like. In our environments, we use Terragrunt (a tool developed by Gruntwork) for the following reasons:

  1. You don’t need to know Terraform to create your resources
  2. You can keep your code DRY
  3. Each resource is isolated, so our changes will be applied faster.
  4. Multiple remote states and well-structured folders.
  5. Speed because we are able to run in parallel multiple resources.

Our developers know they can search our IAC library to find what they need, and open a PR and describe what they need. If the PR is approved Atlantis will run the changes in our production systems. This solution gives us the autonomy for each team to run their code without any dependencies, except for when their code is reviewed.

In the next section, we’ll discuss how to test our code in our environments.

Testing

Infrastructure as code is still code. So, we need to write tests to ensure that everything works perfectly. We use Terratest (also developed by Gruntwork) and have two categories for tests; unit and integration tests.

Unit tests: In each repo for our modules we have an example folder with the most used cases of a module. We use them to ensure that our changes don’t break down our module. This has been proved to be very useful as we can use examples for testing purposes.

Integration tests: Our integration tests are stored in the environments repositories. When we want to build a new service that has dependencies between different modules and ensure all dependencies are created successfully, it is an integration test.

All the tests are running in a playground AWS account.

Where next?

Following the use of unit and integration tests for our infrastructure code, we want to add a new category; ‘end to end tests’. This will give us the opportunity to recreate our infrastructure with the click of a button, and treat them like cattle -not like a pet.

We also plan to enrich our Infrastructure as Code Library with new modules based on our Engineering team needs. We plan to promote immutable, versioned infrastructure across environments, share the knowledge with developers, and start building on top of battle-tested infrastructure code.

Interested in our projects? We are looking for a Senior DevOps Engineer to join us on the ride.

Read more about our DevOps and Infrastructure projects.

Nestoras Stefanou is a Lead DevOps Engineer at Beat. He is interested in building immutable infrastructure and battle-tested infrastructure code. A Go lover who wants to build performant, reliable, and scalable microservices.

--

--