Shortly after a new version of the application was rolled out at 10:26 AM UTC, the system performance degraded due to a high utilization of the SQL database.
Because a couple of new background jobs were deployed, these were soon suspected to be the culprits and at 10:47 AM UTC, we stopped processing the background jobs. This brought the database utilization down and the application started serving web traffic reliably again.
Subsequent investigation of the database performance history revealed a single SQL query dominating resource utilization since the rollout. We identified source of the query in the application code and deployed a hotfix, getting rid of the responsible code. This allowed us to allow the background job processing again at 10:57 AM UTC.
2. It took 7 minutes to detect the issue, which is not ideal.
1.a) We will fix the query performance.
1.b) We will also investigate the possibility of detecting this kind of database issues proactively.
2. We will identify monitoring conditions which would page us immediately upon triggering.