API Endpoints

Current Issues

Data consistency between backing nodes is not guarenteed.
Health checks are bypassed, so nodes are not automatically removed from the target group when they are down.
Inconsistent errors reported by users when querying every second.
Block explorer backend is reliant on one explorer node.
Endpoints briefly go down during rolling upgrades.

Proposals

Data Consistency

Use an IP Hash load balancing algorithm, instead of the current Round Robin algorithm. All requests from the same IP will be send to the same node, guarenteeing data consistency, but requests would no longer be evenly split among the nodes. The current ELB & Target Group set up does not support IP Hash load balancing. AWS does support this option for their Network load balancing option. AWS also has a "sticky" option to redirect requests from the same IP, which requires the targets to support cookies. This problem should also be reduced by fast state syncing.

Health Checks

We definitely should be using some sort of health check to automatically remove unhealthy instances to keep endpoint uptime. Currently, if one of the explorer nodes goes offline, ~50% of the endpoint requests will fail. AWS port checking relies on status code and latency from a port ping & is incompatible with RPC server on port 9500. AWS also support checking a specific route, which could be added to the Harmony binary to indicate node liveliness, but there have been cases where the RPC server does not start correctly, even though the node is running consensus properly. Alternatively, implement our own load balancing with health checks using hmy_syncing RPC to check node sync status and RPC server status.

Inconsistent Errors

Sesameseed reporting getting these errors trying to connect to the shard 0 endpoint sporadically. Soph confirmed that he also saw this issue with the scripts that the P-Ops were using to run their nodes. The issue was not solved, but they worked around it with Soph's suggestion to load balance on the endpoint & their own internal nodes.

request to https://api.s0.t.hmny.io/ failed, reason: Client network socket disconnected before secure TLS connection was established

Need further investigation into this issue, but was unable to reproduce it. Suspect some sort of rate limiter placed by AWS, since we did not have a rate limiter on the endpoint at the time of the error report from Sesameseed.

Block Explorer

With more utilization of the endpoints from validators & developers, I suggest moving the block explorer backend to a dedicated set of explorer nodes.

Rolling Upgrade

Properly remove and add the instances to the target group when performing rolling upgrades. Or this can be solved by using health checks and staggering the explorer node upgrades within the rolling upgrade process.

Alternative Load Balancers

A simple IP Hash load balancer can be implemented using NGINX.

Increase Number of Nodes

With increased activity, I think we should increase the number of nodes behind the endpoint to 3. At the very least increase the number of nodes for shard 0. Even though we have added 2 more endpoints for shard 0, we should still increase it for those who do not want to load balance for themselves. With 2 nodes, if maintenance needs to be performed for the explorer nodes, there is a single point of failure for the endpoint.

Additional Features/Ideas

Gather data on RPC usage using a reverse proxy on top of the load balancer.
Monitor uptime on individual nodes using health checks & request latency checks
Custom load balancing RPC calls that require archival mode to an archival node & running non-archival for the rest of the explorer nodes to save storage costs.

rhazberries/endpoints.md

Select an option

No results found

Select an option

No results found

Current Issues

Proposals

Data Consistency

Health Checks

Inconsistent Errors

Block Explorer

Rolling Upgrade

Alternative Load Balancers

Increase Number of Nodes

Additional Features/Ideas

Implementation Plan

LeoHChen commented Jul 11, 2020

Uh oh!

LeoHChen commented Jul 11, 2020

Uh oh!

rlan35 commented Jul 11, 2020

Uh oh!