Skip to content

Instantly share code, notes, and snippets.

@rhazberries
Last active July 23, 2020 00:02
Show Gist options
  • Select an option

  • Save rhazberries/8238a021d123f9dc8fa7e8617121bef3 to your computer and use it in GitHub Desktop.

Select an option

Save rhazberries/8238a021d123f9dc8fa7e8617121bef3 to your computer and use it in GitHub Desktop.
API Endpoints

Current Issues

  1. Data consistency between backing nodes is not guarenteed.
  2. Health checks are bypassed, so nodes are not automatically removed from the target group when they are down.
  3. Inconsistent errors reported by users when querying every second.
  4. Block explorer backend is reliant on one explorer node.
  5. Endpoints briefly go down during rolling upgrades.

Proposals

Data Consistency

Use an IP Hash load balancing algorithm, instead of the current Round Robin algorithm. All requests from the same IP will be send to the same node, guarenteeing data consistency, but requests would no longer be evenly split among the nodes. The current ELB & Target Group set up does not support IP Hash load balancing. AWS does support this option for their Network load balancing option. AWS also has a "sticky" option to redirect requests from the same IP, which requires the targets to support cookies. This problem should also be reduced by fast state syncing.

Health Checks

We definitely should be using some sort of health check to automatically remove unhealthy instances to keep endpoint uptime. Currently, if one of the explorer nodes goes offline, ~50% of the endpoint requests will fail. AWS port checking relies on status code and latency from a port ping & is incompatible with RPC server on port 9500. AWS also support checking a specific route, which could be added to the Harmony binary to indicate node liveliness, but there have been cases where the RPC server does not start correctly, even though the node is running consensus properly. Alternatively, implement our own load balancing with health checks using hmy_syncing RPC to check node sync status and RPC server status.

Inconsistent Errors

Sesameseed reporting getting these errors trying to connect to the shard 0 endpoint sporadically. Soph confirmed that he also saw this issue with the scripts that the P-Ops were using to run their nodes. The issue was not solved, but they worked around it with Soph's suggestion to load balance on the endpoint & their own internal nodes.

request to https://api.s0.t.hmny.io/ failed, reason: Client network socket disconnected before secure TLS connection was established

Need further investigation into this issue, but was unable to reproduce it. Suspect some sort of rate limiter placed by AWS, since we did not have a rate limiter on the endpoint at the time of the error report from Sesameseed.

Block Explorer

With more utilization of the endpoints from validators & developers, I suggest moving the block explorer backend to a dedicated set of explorer nodes.

Rolling Upgrade

Properly remove and add the instances to the target group when performing rolling upgrades. Or this can be solved by using health checks and staggering the explorer node upgrades within the rolling upgrade process.

Alternative Load Balancers

A simple IP Hash load balancer can be implemented using NGINX.

Increase Number of Nodes

With increased activity, I think we should increase the number of nodes behind the endpoint to 3. At the very least increase the number of nodes for shard 0. Even though we have added 2 more endpoints for shard 0, we should still increase it for those who do not want to load balance for themselves. With 2 nodes, if maintenance needs to be performed for the explorer nodes, there is a single point of failure for the endpoint.

Additional Features/Ideas

  1. Gather data on RPC usage using a reverse proxy on top of the load balancer.
  2. Monitor uptime on individual nodes using health checks & request latency checks
  3. Custom load balancing RPC calls that require archival mode to an archival node & running non-archival for the rest of the explorer nodes to save storage costs.

Implementation Plan

  1. Health check
  • Implement /node-sync to return status code 200 if node is in sync. This route will also return a boolean, which Trust Wallet & others can use to check if the explorer nodes are in sync on their own side. This health check will not catch the case where the RPC server does not start properly, since the explorer service is served on a different port. PR
  • Enable the check on the node & monitor the uptime over the course of a day to check stability (through UptimeRobot without PagerDuty alerts). If stability is concerning, considering increasing the number of nodes behind the load balancer to 3 on each shard.

RESULTS:

1 explorer node per shard, 1 minute intervals over ~20 hours

  • Shard 0: 95% (5 downtimes, longest: 55 minutes)
  • Shard 1: 100%
  • Shard 2: 99.9% (1 downtime, 1 minute)
  • Shard 3: 100%
  • Add another explorer node for Shard 0.
  • Enable the health check on the load balancer if stability looks good.
  1. Data consistency
  • Test sticky sessions using Testnet Shard 2 endpoint.

RESULTS: Sticky sessions are for continuous connections to the endpoint, does not work for singular requests using curl or pyhmy.

  • See if health checks are good enough, else need to pursue potentially other load balancer options.
  • Try swapping the Application Load Balancer to a Network Load Balancer.
  1. DB Backups (Reduce downtime due to syncing)
  • Increase explorer node storage spacet to 600GB, because the snapshot script makes a local copy of the db to upload to s3 to reduce node downtime. (200GB x 2 for Shard X DB + 200GB for Beacon Chain DB, Explorer DB sized is <1GB)
  • Automate daily archival db snapshots using existing snapshot script.
  • Automate daily explorer db snapshots (may need to change snapshot script to support explorer db).
@LeoHChen
Copy link

Some comments.

  1. we shall try not to implement our own load balancer, exploring existing options like AWS/GCP would be better.
  2. implement health check and auto offline the un-sync'ed one would be helpful.
  3. we can add one more explorer nodes behind shard0 for sure.

@LeoHChen
Copy link

For rolling upgrade,

  1. we can do upgrade one at a time, to make sure the end points are not offline at the same time.
  2. we shall still need to implement the sync'ed api, harmony-one/harmony#3194

@rlan35
Copy link

rlan35 commented Jul 11, 2020

IP hash load balance is definitely good. We don't have much load anyway or any peaks, so the IP hash is already a good random balancer.

Definitely we need to add /health_check to the node that report true when, && <node is fully sync'ed>

Agree to add more node behind the balancer as we get more usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment