Mews is not accessible

Incident Report for Mews

Postmortem

Problem

Most of the requests sent during 27.8.2019 08:00 - 09:00 UTC either took a long time or failed completely
and our jobs were stopped completely.

Action

We immediately started investigating the degraded performance, and found an issue with our database. We contacted Azure support, where our database is hosted. There was nothing unusual in any of our monitoring systems and so we preventively stopped the background jobs, but saw no improvement. Before Azure support reached back to us, we did a failover to another replica within the same region, which also did not help. A few minutes after the system went back to normal, Azure support contacted us that there was some failure on their side resulting in configuration changes of unplanned operation (hardware or software failure). After web was running for few minutes without problems, we turned the jobs back on.

Causes

The major cause was on Azure’s side and this is the the root cause they shared with us:

We have recently introduced an optimization to throttle log generation rate on geo primary in case when customer has geo secondary with lower SLO than geo primary and geo secondary is unable to keep up with workload on geo primary. This is to prevent overflowing log on geo primary (since we cannot truncate log before geo secondary commit changes in its log) as well as significant data loss in case customer needs to perform forced failover. More information can be found in this announcement: https://azure.microsoft.com/en-in/updates/new-active-geo-replication-optimization-is-coming-to-production-soon/
Unfortunately, this optimization had a bug in it causing latch contention and can result in sql server process crash. This caused unplanned failover and unavailability for mews database.

Apart from that, our infrastructure to that point was set up in a way which did not allow us to perform a failover to another database quicker than in 2 hours.

Solutions

We improved our infrastructure and processes to be able to perform a failover in no time. We also organized a periodic check of this process on our demo environment. Apart from that we improved resiliency even within one region, so that our database is spread across more locations and so has fewer points of failure.

Posted Sep 24, 2019 - 21:50 CEST

Resolved

Everything is operational. We will keep monitoring it.

Posted Aug 27, 2019 - 12:19 CEST

Monitoring

We've performed failover to a secondary database and see an improvement. We're continuing to monitor the situation.

Posted Aug 27, 2019 - 11:19 CEST

Update

We are continuing to investigate this issue.

Posted Aug 27, 2019 - 11:15 CEST

Investigating

We are experiencing a performance issue with the system.

Our engineers are investigating the cause of the problems.

Posted Aug 27, 2019 - 10:25 CEST

This incident affected: Operations.