On Saturday 2020-10-17
at 9:14 UTC
an issue was raised as Mews became unavailable for some properties while being very slow for others. Team started investigation immediately, but in first moments haven't found any obvious causes for issues being experienced by properties. The only sign was high CPU usage across web server instances, other monitoring tools didn't show any significant anomalies. The system was mostly operational with intermittent peaks and slowdowns until about 17:00 UTC
when the performance was restored.
However similar performance issues reappeared on Tuesday morning and other days, albeit causing only slowdowns and not complete outage. We’d like to apologize for this performance degradation, we’re committing to take major steps in improving the stability and robustness of the system especially in the area of open API and integration partner management. Therefore, our infrastructure and connectivity teams will drop their feature roadmap plans until the end of this year and will focus on resolving all the causes of this problem.
As there was no obvious cause, our platform team increased the number of web server instances to mitigate the issue. We found out several patterns that all together made this negative impact on our services. There was one partner doing resynchronization of large amount of data, another one using the API in a very inefficient way and ultimately one of our operations (rates/updatePrice
) not being robust enough to handle larger than normal amount of requests.
We temporarily disabled integrations for the partner with inefficient API usage which greatly improved the responsibility of our system. Later we noticed that some data in our database are duplicated, so we cleaned it up and started fixing the root cause which led to this duplication. We also noticed a bug in rates/updatePrice
which negatively affected performance of this operation.
rates/updatePrice
operation, combined with unrestricted number of updates allowed in one request resulting in way too long processing times. It lead to lot of useless data being created in database as result of concurrency issue on long running requests.rates/updatePrice
operation by reducing CPU usage overhead and removing unnecessary instructions. Restrict number of updates in one request for this operation to 1000.