Created
February 18, 2018 20:16
-
-
Save Casperhr/eb0ae683086ce130492e2cf82590eb85 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Scaling plan for Riide Backend | |
| last updated 16 feb 2018 | |
| ### Scaling bottle necks | |
| - MySQL is growing rapidly, 52gb (feb 2018, grows 2-3gb a week atm). This is not a urgent issue. But we need to handle this before it grows above 600GB. | |
| And the earlier we solve this. The faster is it (since you are migrating less data) | |
| - It is possible to just buy a bigger server, but restoring backups / horizontal scaling will take days in the end | |
| - MySQL is just one RDS server. Which handles all traffic. Horizontal scaling should be looked into | |
| - Queues, jobs are too big or takes too long time. | |
| - user_update jobs 1:N issue, we are syncronizing all users with each company. And pulling icabbi bookings from each company. | |
| - Throwing more server power at this could solve it yes. | |
| - Fallback system looping every active booking every 5min (configurable), 1:N issue again | |
| - Throwing more server power at this could solve it yes. | |
| ### Other issues | |
| - No application monitoring, we have to see the issues, replicate them before we can work on a fix (unless it's a crash (bugsnag)) | |
| - Peak periods are always in the weekends, where we are not work. | |
| - Icabbi integration is not a provider anymore, we have coupled the systems very much | |
| - New companies are added and setup with errors, alerts are not getting solved | |
| - There is a overlap of webhooks and our fallback system, which is the issue to several bugs | |
| - Webhook delay is not always working | |
| - Double charges (some cases) | |
| - Drivers getting payment success and then failed | |
| ### Suggestions | |
| - Install new relic APM , best way to monitor how the app is used, and where to spend time on optimization (API calls) | |
| - Setup elastic search | |
| - Move all historical data / none mission critical(booking updates, payment updates, webhook requests, booking locations etc) out of MySQL to Elastic search | |
| - Build / setup a queue / command monitoring tool. We need to get an overview of what is running when. And if they finish in time. Afterwards optimize it | |
| - MySQL optimize | |
| - Look into booking obj, if eg locations / directions can removed or moved to either Elastic or files | |
| - Look into booking indexes (2.1GB) | |
| - Order without index(While there is nothing wrong with a high amount of row sorting, you might want to make sure that the queries which require a lot of sorting use indexed columns in the ORDER BY clause, as this will result in much faster sorting.) | |
| - Joins without index( This means that joins are doing full table scans. Adding indexes for the columns being used in the join conditions will greatly speed up table joins.) | |
| - Optimize tables (function) | |
| - Run through config update suggestions | |
| - Improve clean up scripts | |
| - Setup read / write system for MySQL, for horizontal scaling | |
| - Consider Aurora DB (Amazons new MySQL (15.feb 5.7 support)) | |
| - Improve the dashboard | |
| - Improve the alert system | |
| ### Prices | |
| - New relic - 400£ (waiting offer) | |
| - Elastic - 300£ (250$ prod, 58$ stag) | |
| - Setup read / write system for MySQL - 750-1500£ (950$ per instance) | |
| - 3 more replicas - 580£ (750$) | |
| ### Development estimates | |
| - Setup new relic - 10h | |
| - Setup elastic - 20h | |
| - Build Queue / command monitoring tool - 30h | |
| - MySQL optimize - 30h | |
| - Improve clean up scripts - 15h | |
| - Setup read / write system for MySQL - 75h |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment