Saving $23,640 by just optimizing queries

Before I started making changes to the infrastructure the rule was simply go one step up in server size, everytime there was an spike and it was confirmed that the database was the root case the server size was increased and at some point we had a single 8xlarge instance for our main database.
In the past there was no logging system implemented and the actions were based only on resources usage. We started getting better insight after our Datadog implementation in AWS, we set alerts for resources usage and we also enabled logs from the database including the slow queries logs one.
At the beginning we were checking logs on cloudwatch when an alert was raised, making it a bit slow to realize what was actually happening, I’m glad at some point I introduced Grafana as an alternative to get realtime reports so we could measure the health of our website. After a few days of the implementation I discovered Grafana has community that shares their dashboards and by simply copying the ids you can set the same dashboard.
I used this cool one for Cloudwatch logs
The dashboard let you filter by the log groups and by default I set it to show the slow queries logs, with this I was able to quickly find the queries and also the origin of them as this report gives you all the information related to the query, the user, host, execution time and of course the full query.
When an slow query is detected the process is simple, find the portion of code that is executing it, try to optimize the query by either chaging the structure or adding indexes, sometimes it is also required to change the logic behind the query to improve it.
This is an everyday work, as soon as we detect a queries that impact the perfomance we inmediately attemp to fix it, as a consecuence of this our main database is a much smaller instance a as you can see in the following graphic, most of the time it is under 10% of usage, allowing us to downsize it even more.
Also we implemented read replicas that manage all the select queries and reporting, offloading all that from the writter instance, and we can probably downsize it as well as cpu usage is always under 25%.
By implementing these tools now measuring our databases health is easier by the time we are able improve speed, resiliency and response time.
Get quality content updates subscribing to the newsletter, Zero Spam!