Skip to content

Instantly share code, notes, and snippets.

@Casperhr
Created February 18, 2018 20:17
Show Gist options
  • Select an option

  • Save Casperhr/6c60ec530412de5271f6a3ba63e35c6f to your computer and use it in GitHub Desktop.

Select an option

Save Casperhr/6c60ec530412de5271f6a3ba63e35c6f to your computer and use it in GitHub Desktop.

Scaling plan for Riide Backend

last updated 16 feb 2018

Scaling bottle necks

  • MySQL is growing rapidly, 52gb (feb 2018, grows 2-3gb a week atm). This is not a urgent issue. But we need to handle this before it grows above 600GB. And the earlier we solve this. The faster is it (since you are migrating less data)
    • It is possible to just buy a bigger server, but restoring backups / horizontal scaling will take days in the end
  • MySQL is just one RDS server. Which handles all traffic. Horizontal scaling should be looked into
  • Queues, jobs are too big or takes too long time.
  • user_update jobs 1:N issue, we are syncronizing all users with each company. And pulling icabbi bookings from each company.
    • Throwing more server power at this could solve it yes.
  • Fallback system looping every active booking every 5min (configurable), 1:N issue again
    • Throwing more server power at this could solve it yes.

Other issues

  • No application monitoring, we have to see the issues, replicate them before we can work on a fix (unless it's a crash (bugsnag))
  • Peak periods are always in the weekends, where we are not work.
  • Icabbi integration is not a provider anymore, we have coupled the systems very much
  • New companies are added and setup with errors, alerts are not getting solved
  • There is a overlap of webhooks and our fallback system, which is the issue to several bugs
    • Webhook delay is not always working
    • Double charges (some cases)
    • Drivers getting payment success and then failed

Suggestions

  • Install new relic APM , best way to monitor how the app is used, and where to spend time on optimization (API calls)
  • Setup elastic search
  • Move all historical data / none mission critical(booking updates, payment updates, webhook requests, booking locations etc) out of MySQL to Elastic search
  • Build / setup a queue / command monitoring tool. We need to get an overview of what is running when. And if they finish in time. Afterwards optimize it
  • MySQL optimize
    • Look into booking obj, if eg locations / directions can removed or moved to either Elastic or files
    • Look into booking indexes (2.1GB)
    • Order without index(While there is nothing wrong with a high amount of row sorting, you might want to make sure that the queries which require a lot of sorting use indexed columns in the ORDER BY clause, as this will result in much faster sorting.)
    • Joins without index( This means that joins are doing full table scans. Adding indexes for the columns being used in the join conditions will greatly speed up table joins.)
    • Optimize tables (function)
    • Run through config update suggestions
  • Improve clean up scripts
  • Setup read / write system for MySQL, for horizontal scaling
    • Consider Aurora DB (Amazons new MySQL (15.feb 5.7 support))
  • Improve the dashboard
  • Improve the alert system

Prices

  • New relic - 400£ (waiting offer)
  • Elastic - 300£ (250$ prod, 58$ stag)
  • Setup read / write system for MySQL - 750-1500£ (950$ per instance)
  • 3 more replicas - 580£ (750$)

Development estimates

  • Setup new relic - 10h
  • Setup elastic - 20h
  • Build Queue / command monitoring tool - 30h
  • MySQL optimize - 30h
  • Improve clean up scripts - 15h
  • Setup read / write system for MySQL - 75h
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment