Downgraded Performance
Incident Report for Mews
Postmortem

On Saturday 2020-10-17 at 9:14 UTC an issue was raised as Mews became unavailable for some properties while being very slow for others. Team started investigation immediately, but in first moments haven't found any obvious causes for issues being experienced by properties. The only sign was high CPU usage across web server instances, other monitoring tools didn't show any significant anomalies. The system was mostly operational with intermittent peaks and slowdowns until about 17:00 UTC when the performance was restored.

However similar performance issues reappeared on Tuesday morning and other days, albeit causing only slowdowns and not complete outage. We’d like to apologize for this performance degradation, we’re committing to take major steps in improving the stability and robustness of the system especially in the area of open API and integration partner management. Therefore, our infrastructure and connectivity teams will drop their feature roadmap plans until the end of this year and will focus on resolving all the causes of this problem.

Action

As there was no obvious cause, our platform team increased the number of web server instances to mitigate the issue. We found out several patterns that all together made this negative impact on our services. There was one partner doing resynchronization of large amount of data, another one using the API in a very inefficient way and ultimately one of our operations (rates/updatePrice) not being robust enough to handle larger than normal amount of requests.

We temporarily disabled integrations for the partner with inefficient API usage which greatly improved the responsibility of our system. Later we noticed that some data in our database are duplicated, so we cleaned it up and started fixing the root cause which led to this duplication. We also noticed a bug in rates/updatePrice which negatively affected performance of this operation.

Causes
  1. Running both API and web controllers on same instance means that excessive API traffic can negatively impact end user experience.
  2. There have been multiple parallel request which were not properly handled by Mews resulting in duplicated entries in DB. This caused slowdown for subsequent requests.
  3. Poor performance of rates/updatePrice operation, combined with unrestricted number of updates allowed in one request resulting in way too long processing times. It lead to lot of useless data being created in database as result of concurrency issue on long running requests.
  4. Missing timeout on long running requests (which have been probably discarded on partner side anyway) wasting our processing resources.
  5. Inefficient throttling. It's currently monitoring only number of requests per integration and cannot detect unusually high traffic from one partner for multiple properties (full resync).
  6. We do not have strict process how to deal with partners that are using our API in a wrong way. They are notified, but the case of bad usage can fizzle out without any further action from both parties.
  7. Performance monitoring of operations is not precise and granular enough.
  8. We don’t support webhooks for the most common data changes in Mews resulting in partners having to periodically poll for updated data.
  9. Finding the root cause was slowed down by improper distinction between API-created entities and system-created entities.
Solutions
  1. We’ll segregate application instances into two groups - one would handle our open APIs, other one will serve traffic of our products.
  2. Extend usage of concurrency tokens to prevent duplication in case of complex updates.
  3. Improve performance of rates/updatePrice operation by reducing CPU usage overhead and removing unnecessary instructions. Restrict number of updates in one request for this operation to 1000.
  4. Implement proper timeout handling for all transactions in the system.
  5. Implement smarter throttling solution. It will monitor traffic on partner, integration and endpoint levels.
  6. Set up a strict process which ensures that cases of bad API usage will be followed upon.
  7. Implement anomaly detection in Azure based on request duration, with granularity on the partner level, integration level and endpoint level.
  8. Introduce webhooks so partners do not need to poll for updated data.
  9. Extend the use of profiles to distinguish who (integration, internal job, ...) created which entity.
Posted Nov 26, 2020 - 13:12 CET

Resolved
The incident has been resolved, the system is operational for longer period of time. We are working towards post-mortem and further improvements to prevent this from happening in the future.
Posted Oct 17, 2020 - 19:46 CEST
Monitoring
A fix has been implemented and we are monitoring the results. No issue has been present for 30 minutes already.
Posted Oct 17, 2020 - 16:31 CEST
Update
The system is overall operational. However, we still experience small intermittent peaks of performance degradation and continue in investigation.
Posted Oct 17, 2020 - 14:12 CEST
Investigating
We are experiencing intermittent problems again and investigating the root cause.
Thank you for your patience.
Posted Oct 17, 2020 - 13:13 CEST
Monitoring
We have isolated the cause of the issue and resolved it. We will continue monitoring.
Posted Oct 17, 2020 - 12:23 CEST
Investigating
We have noticed a downgraded performance of the system.
We are currently looking into it.
Thank you for your patience.
Posted Oct 17, 2020 - 11:34 CEST
This incident affected: Operations, Guest Journey, Payments, and Open API.