Before I started making changes to the infrastructure the rule was simply go one step up in server size, everytime there was an spike and it was confirmed that the database was the root case the server size was increased and at some point we had a single 8xlarge instance for our main database.
In the past there was no logging system implemented and the actions were based only on resources usage. We started getting better insight after our Datadog implementation in AWS, we set alerts for resources usage and we also enabled logs from the database including the slow queries logs one.