Starting to use IaC templates was an improvement in our operation, as when we wanted to deploy a new environment, we simply copied a similar template and made some small changes to adapt it to the new desired infrastructure. Soon, we found out an issue with this pattern: we were creating application load balancers for every new service deployed.

There were not so many, something around 9 services, maybe, but the issue is we also implemented lower environments for testing purposes. In our case, Development and QA, besides Production, and they are an exact replica, obviously with smaller capacity. But when we talk about the ALBs, they have a base cost of around $19 for having them running and a variable cost depending on usage, called LCU cost. In the following table, we can see the total costs associated:

Environments ALB base cost Services deployed $ Monthly $ Yearly
3 19$ 9 $513 $6,156

After discussing this with the team, I was in charge of performing the migration. It was fixed quickly, as we simply created a main load balancer using IaC, and from every service template, we only associated it with the load balancer. Everything was cool in general, but we had an unexpected inconvenient.

As part of the plan, we were aware of a short downtime during the execution of the new changes, and the business was notified about this, so we were covered. Also, these changes were planned to be executed during the lowest activity period, minimizing the impact for customers. As we had 2 lower environments, the logic here is performing the changes there first, so I can make sure everything is working before deploying any changes to the Production infrastructure.

The DEV services were fully migrated in a few hours without many issues - the usual mistyping of a reference, and after fixing that, all the other templates were migrated seamlessly. The next day was the time to migrate the QA environment, and everything was looking good until I started migrating the Tax API service. For around 10 minutes, the service was down until I realized it, thanks to the alerts set for all the API calls made from the Production App. Immediately, I knew something was wrong with the changes and performed a rollback. This took like 15 more minutes, and at the end, the errors disappeared, and our Production app was working again.

After fixing the issue, I informed the CTO about it. My suspicion was that probably the QA environment was set as a reference for the Production one and it was something similar. In the meantime, the CTO and I were checking the database to determine the total time offline and the total transactions lost. To our surprise, we didn’t lose any, but we stopped collecting taxes from a few transactions, something that the finance team could manually fix in the books.

Knowing that the operations were not affected, I started investigating the issue. After a few minutes reading the code, the reason was evident: the API reference in the code had a default value to use in case the environment variable was missing. Yes, after the Tax API implementation, no one added the environment variable, and by default, the 3 environments were pointing to QA.

After fixing the issue, I started creating the Post Mortem document as part of our protocol in these cases. Our Post Mortem procedure is creating the document based on all the information we found around the issue, the registered time for every event, including the actions and recommendations for the case, and the most important, No Blame. The document doesn’t provide any name, only events; the idea is figuring out what is wrong and sharing it with the entire team so we avoid the same events in the future.

Now that the main issue was solved, I continued with the migration. QA was fully migrated, and the next day, it was the time for Production. I performed the changes during the time we considered was safe and that only a few transactions were affected. To make sure of this, we tracked all of the transactions using Grafana.

Individual Transactions report: Individual Transactions report

Transactions Heatmap: Transactions Heatmap

After the migration, there were only 3 ALBs running, which means around $57 monthly or $684 yearly, meaning $5,473 in savings.