System Unavailable
Incident Report for Mews
Postmortem

First of all, we would like to apologize for unresponsiveness of Mews on the 24th of October 2018 during night hours (21:40 — 22:05 UTC). We take such moments seriously and always work hard to ensure such and similar problems won't occur in the future.

Cause

As we are continuously updating our system in order to deliver new features, we update the database model accordingly during deployments. The last deployment introduced changes that took unexpected amount of database resources and after a while the operation timed-out. But one of the application instances kept retrying this process, due to built-in recovery procedure. That caused the database to be unresponsive for longer period.

Action

We have reverted the change and deployed our application without it.

Solution

In order to properly resolve the issue and prevent anything similar from reoccurring, we have introduced several improvements to our processes:

  • We have banned all potentially non-performant database changes and enforced to use more efficient means.
  • We have improved our monitoring system for crucial errors.
  • Application instances will no-longer retry the migration process in case of fatal failures.
Posted Oct 29, 2018 - 17:42 CET

Resolved
This incident has been resolved.
Posted Oct 25, 2018 - 00:35 CEST
Monitoring
We've resolved the issue and continue to monitor system health.
Posted Oct 25, 2018 - 00:10 CEST
Update
We are continuing to investigate this issue.
Posted Oct 24, 2018 - 23:50 CEST
Investigating
We are currently investigating the issue and performing configuration changes that should resolve the problem.
Posted Oct 24, 2018 - 23:50 CEST
This incident affected: Operations, Guest Journey, and Open API.