Starting using IaC templates was an improvement in our operation as when we want to deploy a new environment we simply copy a similar template an make some small changes to adapt it to the new desired infrastructure. Soon we found out an issue with this pattern, we were creating application load balancers for every new service deployed.

There were not sooo many, something around 9 services maybe, but the issue is we implemented also lower environments for testing purposes, in our case Development and QA besides Production and they are an exact replica with obviously with smaller capacity, but when we talk about the ALBs they have a base cost of around 19$ for having then running and a variable cost depende on usage called LCU cost, in the folowing table we can see the total costs associated:

Environments ALB base cost Services deployed $ Montly $ Yearly
3 19$ 9 $513 $6,156

After discussing this with the team, I was in charge of performing the migration, It was fixed quickly as we simply created a main load balancer using IaC and from every service template we only associated it with the load balancer, everything was cool in general but we had an unexpected inconvenient.

As part of the plan we were aware of a short downtime during the execution of the new changes and the business was notified about this, so we were covered, also these changes were planified to be executed during the lowest activity period minimizing the impact for customers. As we had 2 lower environments, the logic here is performing the changes there first to I can make sure everything is working before deploying any changes to the Production infrastructure.

The DEV services were fully migrated in a few hours without many issues, the usual, mistyping a reference and after fixing that all the other templates were migrated seamlesly. The next day was the time to migrate the QA environment and everything was looking good until I started migrating the Tax API service, for around 10 minutes the service was down until I realized of it thanks to the alerts set for all the API calls made from the Production App, inmeadiatly I knew something was wrong with the changes and performed a rollback. This took like 15 more minutes and at the end the errors disapeared and our Production app was working again.

After fixing the issue I informed the CTO about it, my suspicion was that probably the QA environment was set as reference for the Production one and it was something similar, in the meantime the CTO and I were checking the database to determinate the total time offline and the total transactions lost, to our surprise we did’nt lose any but we stopped collecting taxes from a few transactions, something that the finance team can manually fix in the books.

Knowing that the operations were not affected I started investigating the issue, after few minutes reading the code, the reason was evident, the API reference in the code had a default value to use in case the environment variable is missing. Yes, after the Tax API implementation no one added the environment variable and by default, the 3 environmets were pointing to QA.

After fixing the issue I started creating the Post Mortem document as part of our protocol in these cases. Our Post Mortem procedure is creating the document based on all the information we found around the issue, the registered time for every event, including the actions and recommentions for the case, and the most important No Blame. The document doesn’t provide any name, only events, the idea is figuring out what is wrong and sharing with the entire team so we avoid the same events in the feature.

Now that the main issue was solved, I continued with the migration, QA was fully migrated and the next day it was the time for Production, I performed the changes during the time we considered was safe and that only a few transactions were affected, to make sure of this we tracked all of the transactions using Grafana.

Invidual Transactions report: Individual Transactions report

Transactions Heatmap: Transactions Heatmap

After the migration there were only 3 ALBs running which means around $57 monthly or $684, meaning $5,473 in savings.