System Unavailable

Incident Report for Mews

Postmortem

First of all, we would like to apologize for unresponsiveness of Mews on the 24th of October 2018 during night hours (21:40 — 22:05 UTC). We take such moments seriously and always work hard to ensure such and similar problems won't occur in the future.

Cause

As we are continuously updating our system in order to deliver new features, we update the database model accordingly during deployments. The last deployment introduced changes that took unexpected amount of database resources and after a while the operation timed-out. But one of the application instances kept retrying this process, due to built-in recovery procedure. That caused the database to be unresponsive for longer period.

Action

We have reverted the change and deployed our application without it.

Solution

In order to properly resolve the issue and prevent anything similar from reoccurring, we have introduced several improvements to our processes:

We have banned all potentially non-performant database changes and enforced to use more efficient means.
We have improved our monitoring system for crucial errors.
Application instances will no-longer retry the migration process in case of fatal failures.

Posted Oct 29, 2018 - 17:42 CET

Resolved

This incident has been resolved.

Posted Oct 25, 2018 - 00:35 CEST

Monitoring

We've resolved the issue and continue to monitor system health.

Posted Oct 25, 2018 - 00:10 CEST

Update

We are continuing to investigate this issue.

Posted Oct 24, 2018 - 23:50 CEST

Investigating

We are currently investigating the issue and performing configuration changes that should resolve the problem.

Posted Oct 24, 2018 - 23:50 CEST

This incident affected: Operations, Guest Journey, and Open API.