Unexpected outage

Incident Report for Mews

Postmortem

Problem

Shortly after a new version of the application was rolled out at 10:26 AM UTC, the system performance degraded due to a high utilization of the SQL database.

Action

Because a couple of new background jobs were deployed, these were soon suspected to be the culprits and at 10:47 AM UTC, we stopped processing the background jobs. This brought the database utilization down and the application started serving web traffic reliably again.

Subsequent investigation of the database performance history revealed a single SQL query dominating resource utilization since the rollout. We identified source of the query in the application code and deployed a hotfix, getting rid of the responsible code. This allowed us to allow the background job processing again at 10:57 AM UTC.

Causes

An expensive query executed whenever a certain type of background job was to be processed.

‌ 2. It took 7 minutes to detect the issue, which is not ideal.

Solutions

‌ 1.a) We will fix the query performance.

‌ 1.b) We will also investigate the possibility of detecting this kind of database issues proactively.

‌ 2. We will identify monitoring conditions which would page us immediately upon triggering.

Posted Apr 15, 2021 - 17:15 CEST

Resolved

This incident has been resolved. We are still monitoring the situation closely and working on a full post-mortem.

Posted Mar 29, 2021 - 13:33 CEST

Monitoring

A fix has been implemented and we are monitoring the results. The system is currently operational.

Posted Mar 29, 2021 - 13:12 CEST

Investigating

We are currently investigating an unexpected outage.

Posted Mar 29, 2021 - 12:47 CEST

This incident affected: Operations, Guest Journey, Business Intelligence, Payments, Open API, and Marketplace.