- Data consistency between backing nodes is not guarenteed.
- Health checks are bypassed, so nodes are not automatically removed from the target group when they are down.
- Inconsistent errors reported by users when querying every second.
- Block explorer backend is reliant on one explorer node.
- Endpoints briefly go down during rolling upgrades.
Use an IP Hash load balancing algorithm, instead of the current Round Robin algorithm. All requests from the same IP will be send to the same node, guarenteeing data consistency, but requests would no longer be evenly split among the nodes. The current ELB & Target Group set up does not support IP Hash load balancing. AWS does support this option for their Network load balancing option. AWS also has a "sticky" option to redirect requests from the same IP, which requires the targets to support cookies. This problem should also be reduced by fast state syncing.
We definitely should be using some sort of health check to automatically remove unhealthy instances to keep endpoint uptime. Currently, if one of the explorer nodes goes offline, ~50% of the endpoint requests will fail.
AWS port checking relies on status code and latency from a port ping & is incompatible with RPC server on port 9500.
AWS also support checking a specific route, which could be added to the Harmony binary to indicate node liveliness, but there have been cases where the RPC server does not start correctly, even though the node is running consensus properly.
Alternatively, implement our own load balancing with health checks using hmy_syncing RPC to check node sync status and RPC server status.
Sesameseed reporting getting these errors trying to connect to the shard 0 endpoint sporadically. Soph confirmed that he also saw this issue with the scripts that the P-Ops were using to run their nodes. The issue was not solved, but they worked around it with Soph's suggestion to load balance on the endpoint & their own internal nodes.
request to https://api.s0.t.hmny.io/ failed, reason: Client network socket disconnected before secure TLS connection was established
Need further investigation into this issue, but was unable to reproduce it. Suspect some sort of rate limiter placed by AWS, since we did not have a rate limiter on the endpoint at the time of the error report from Sesameseed.
With more utilization of the endpoints from validators & developers, I suggest moving the block explorer backend to a dedicated set of explorer nodes.
Properly remove and add the instances to the target group when performing rolling upgrades. Or this can be solved by using health checks and staggering the explorer node upgrades within the rolling upgrade process.
A simple IP Hash load balancer can be implemented using NGINX.
With increased activity, I think we should increase the number of nodes behind the endpoint to 3. At the very least increase the number of nodes for shard 0. Even though we have added 2 more endpoints for shard 0, we should still increase it for those who do not want to load balance for themselves. With 2 nodes, if maintenance needs to be performed for the explorer nodes, there is a single point of failure for the endpoint.
- Gather data on RPC usage using a reverse proxy on top of the load balancer.
- Monitor uptime on individual nodes using health checks & request latency checks
- Custom load balancing RPC calls that require archival mode to an archival node & running non-archival for the rest of the explorer nodes to save storage costs.
Some comments.