Empowering non technical users by allowing feature branch deployments from Slack.

Empowering non technical users by allowing feature branch deployments from Slack.

After several iterations on our pipelines, we got to a point where everything is working smoothly, and most of the new improvements are related to optimization, like reducing 1 minute per step by reducing Docker image sizes, or something else that can be considered a real improvement for our flow.

As we are a Super International Team, some of the team members start work earlier than others. This is not an issue for us, as we are more focused on the value we add to the business than counting how much time we spend sitting in front of the computer. The other day during our daily meeting, I overheard a conversation about who was responsible for deploying the feature environments. These envs are removed every day to avoid unexpected costs, and it looked like they were having a hard time deciding who needed to deploy the environment every morning. To make the story short, they came to an agreement and made it work.

Read more →

Empowering our QA team in order to prevent our flow to production from getting stuck

Empowering our QA team in order to prevent our flow to production from getting stuck

When we implemented the Trunk-based deployment flow, our pipeline to prod sped up a lot. This was really cool, but at the same time, we were producing more code than the QA team could handle, preventing us from deploying specific features to production if there was something already in the queue. We kinda solved this by creating feature branch deployments, but the problem was devs merging to QA instead of deploying the testing envs, causing the issue again.

Read more →

Saving $8,220 and increasing high availability by migrating the main app to containers

Saving $8,220 and increasing high availability by migrating the main app to containers

One of the first infrastructure implementations I made was migrating the main app from 3 EC2 servers to ECS Fargate. This was the outcome of a conversation I had with the CTO when I proposed helping with some DevOps tasks.

There were several reasons that pushed me to commit to this. First, when I started working for the company, I had access to AWS and figured out that our servers were oversized and we were paying too much for them. The main server CPU usage was under 5%; this one was in charge of running a bunch of cron jobs, and the other 2 servers were under 2% on CPU usage. RAM usage was similar.

Read more →

Saving $5,473 by Removing Redundant Application Load Balancers

Saving $5,473 by Removing Redundant Application Load Balancers

Starting to use IaC templates was an improvement in our operation, as when we wanted to deploy a new environment, we simply copied a similar template and made some small changes to adapt it to the new desired infrastructure. Soon, we found out an issue with this pattern: we were creating application load balancers for every new service deployed.

There were not so many, something around 9 services, maybe, but the issue is we also implemented lower environments for testing purposes. In our case, Development and QA, besides Production, and they are an exact replica, obviously with smaller capacity. But when we talk about the ALBs, they have a base cost of around $19 for having them running and a variable cost depending on usage, called LCU cost. In the following table, we can see the total costs associated:

Read more →

Saving $23,640 by just optimizing queries

Saving $23,640 by just optimizing queries

Before I started making changes to the infrastructure, the rule was simply to go one step up in server size. Every time there was a spike and it was confirmed that the database was the root cause, the server size was increased, and at some point we had a single 8xlarge instance for our main database.

Database Billing

In the past, there was no logging system implemented, and the actions were based only on resource usage. We started getting better insight after our Datadog implementation in AWS. We set alerts for resource usage, and we also enabled logs from the database, including the slow query logs.

Read more →