Most of the requests sent during 27.8.2019 08:00 - 09:00 UTC either took a long time or failed completely
and our jobs were stopped completely.
We immediately started investigating the degraded performance, and found an issue with our database. We contacted Azure support, where our database is hosted. There was nothing unusual in any of our monitoring systems and so we preventively stopped the background jobs, but saw no improvement. Before Azure support reached back to us, we did a failover to another replica within the same region, which also did not help. A few minutes after the system went back to normal, Azure support contacted us that there was some failure on their side resulting in configuration changes of unplanned operation (hardware or software failure). After web was running for few minutes without problems, we turned the jobs back on.
The major cause was on Azure’s side and this is the the root cause they shared with us:
We have recently introduced an optimization to throttle log generation rate on geo primary in case when customer has geo secondary with lower SLO than geo primary and geo secondary is unable to keep up with workload on geo primary. This is to prevent overflowing log on geo primary (since we cannot truncate log before geo secondary commit changes in its log) as well as significant data loss in case customer needs to perform forced failover. More information can be found in this announcement: https://azure.microsoft.com/en-in/updates/new-active-geo-replication-optimization-is-coming-to-production-soon/
Unfortunately, this optimization had a bug in it causing latch contention and can result in sql server process crash. This caused unplanned failover and unavailability for mews database.
Apart from that, our infrastructure to that point was set up in a way which did not allow us to perform a failover to another database quicker than in 2 hours.
We improved our infrastructure and processes to be able to perform a failover in no time. We also organized a periodic check of this process on our demo environment. Apart from that we improved resiliency even within one region, so that our database is spread across more locations and so has fewer points of failure.