Private Load Balanced Endpoint Backbone

shinyfoil · September 2, 2020, 3:52am

NOTE: This proposal has still items awaiting comment from community feedback in brackets.

1) Title

2) Project Category

Infrastructure: for supporting the underlying code base of the blockchain - Infrastructure supporting tools, bug patches, node maintenance tools, etc.

3) Project Description

As was outlined in our forum post, ICON’s current centralized endpoint infrastructure presents a critical single point of failure for the entire network.

Having a centralized API endpoint that numerous parties rely on (e.g. ctz.solidwallet.io) is a mission-critical risk. As seen, if this endpoint goes down, it results in a cascading set of problems that impact the whole network. This grant aims to mitigate these problems.

As we see it now (awaiting further discussions with the Foundation), the core of the problem is with the central solidwallet endpoint, which is what all the nodes on the network by default sync off of. If it goes down, all the sub-preps go down. If enough main preps go down after that, the network will stop functioning as they, by default, sync off of solidwallet. Even if a main prep maintains their own citizen node, it would need coordination between main preps such that they are syncing off each other which in of itself creates additional problems.

What we see is needed is a private endpoint that is syncing off several voluntary main preps and serves as a common sync point for the rest of the critical infrastructure. If we focus on making that common sync point as robust and fault tolerant as possible, all the subsequent API infrastructure would be able to sync off it. It is important that this sync point is private such that attackers are unable to access it. We would also be able to potentially make the REST API port 9000 private for main preps as the rest of the network will no longer need access to that port.

After hardening the sync point, further decentralization of the API interface can proceed, particularly in cooperation with Pokt network who appear to have a solid solution to convert the sub-preps into load balanced API nodes and be further compensated for operating a node on the network. These nodes could then take over the load of solidwallet as we prove out its stability on the network. Regardless of if the network fully embraces Pokt as an API provider, having a private network would be critical in making the network more stable.

This type of architecture is not entirely uncommon in other top performing ecosystems. Cosmos, for instance, employs a private sentry layer where nodes are connected to validators over a secure VPN connection. We have implemented such a setup with WireGuard for other ecosystems with this grant laying the groundwork for a follow-on grant to implement this setup. While Cosmos’ networking protocols are different than ICON’s, the protection layer, in practical purposes, is serving the same purpose.

All development of the infrastructure will leverage prior work done for the Polkadot network, where we built a set of load balanced endpoints on each of the major cloud providers. A critical consideration is taken with auto-scaling behavior as baseline operations should be relatively lean but the endpoints need to be able to scale to meet whatever demand. Crucially, scaling needs to be possible in a reasonable amount of time to make it effective. Currently sync time is roughly 2 hours which can be reduced to under 15 minutes if handled by a custom content delivery network that is updated on a regular basis by a source of truth node. We have already implemented this methodology in Polkadot and were able to bring the scaling time down to 5 minutes which makes scaling simple to manage. We intend on implementing this setup on our own and could operate it in the long term such that all nodes could sync and recover faster.

While the proposal is aimed mitigating the specific threat vectors outlined above, by building it with automated deployment tooling we will be able to offer the community a one-click deployment of citizen nodes for DApps, thereby decreasing the reliance on the public solidwallet endpoint. This type of tooling will come with minimal effort when executing this grant. All components will be built with a combination of Terraform, Ansible, Helm, and fronted with a CLI to walk users through the deployment process.

4) Project Duration

4 Months

Development over 4 months
Operation for 12 months as soon as development

5) Project Milestones

Phase 1 - Infrastructure as code modules - 2 Months

Develop terraform modules for AWS, GCP, and Cloudflare with continuous integration testing
- 2 repos for auto-scaling group modules, scaling policies, and load balancer (AWS + GCP)
  - Terraform + Packer + Ansible
- 1 or 2 repos for agents to update firewalls on service discovery of P-Rep’s node IP via API
  - Exact firewall implementation still under consideration.
- 2 repos for Kubernetes clusters (AWS + GCP)
- 1 repo for cloud-init scripts
- 1 repo for library node and CDN for fast scale and sync capabilities (AWS)
  - Current sync time is roughly 2 hours. We aim to bring sync time down to less than 15 minutes to support effective autoscaling capabilities
- 1 repo for Helm charts with Nginx, cert manager, Prometheus, Grafana, Alertmanager, Consul, and Elasticsearch
  - Alarms on each instance on important metrics
- 1 repo for Cloudflare routing and load balancer

Most of phase 1 is relatively off the shelf except for the L4 firewall that needs to be custom made to be responsive to peer discovery on the network. Because of the mission critical nature of this infrastructure, extra considerations need to be made to harden this infrastructure.

Phase 2 - Testing and optimization - 1 Month

Deploy solution on testnet
- Deploy exact configuration as expected on mainnet
Load testing of endpoints with Locust to determine proper scaling policies
Report findings and recommendations for securely operated node topologies

Phase 3 - Integration - 1 Month

Integration into current one-click deployment solution to allow anyone to spin up their own load balanced API endpoints within minutes
Contribution to prep_docker to make endpoint default option

Phase 4 - Deployment and Operations - 12 Months

Deploy [1] cluster on both AWS and GCP and perform geo-routing through cloudflare
Operate clusters with at least [3] nodes nvme backed VMs with [4] cores each and a scalable routing layer
Georouting through cloudflare

6) Funding Amount Requested

Requested amount falls into two budgetary categories, development and operations. For development we are requesting [X$ USD]. For operations, we are requesting a baseline operations fee of [X$ USD / month] to cover expected AWS and management costs along with at-cost accounting for any additional necessary scaling that will be needed.

7) Official Team Name

Insight

8) Team Contact Information (e.g. e-email, telegram id, social media)

Email: insighticon.prep@gmail.com

Telegram ID: @robcio

Social Media: twitter.com/icon_insight

GitHub: github.com/insight-icon

9) Public Address (To receive the grant)

----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

hxd278152d68a34baff1492d9abaf652fa02cbd088
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEfeqitHBQR7zWx7uGhofO4QrZt98FAl9JQHsACgkQhofO4QrZ
t98XXhAAlLPoKISf9LJMtCUS6flV1kpNqJb3iPvfGObNZqYpwcwYSeRObszgUz5b
aqOwz1JzDnrW0RrS/S9KGPawkT7byH16iu/c0EpU7vmgs/hSNmXLxbjdKHt/uEqs
pLjQ1YqmnglO7kNHYTD3bSe+uX1kngw9KaVLUzn1jRmVMB3avqf5PrFuYRry+elj
DdBPpg9KyLbmadFU3axtYLC25uA6n0LsbBbBMD8NZHzYzcOzz82t89iF+ws/iKy7
9oLdEKNAXv24PSlrX06dv7+zwyEyKnG+4h5KNz+MnsZlkcoweOCRsU+E71ZSwa5h
6ZjZ5VxWINuB5lvCHT4H1MkDFlt4OP2uTqFNVsxS7MNQQg8OFGvIJHooFomyoNL8
MIV/pAx1aPfYL3i6YN+usdxBCFdoClw69aGFPslusSzutXNvhrDW/j9cUhKZ3Xsh
qph8Ywj+tvKEPZOeLAh3B4+2KEUBcoyS+AfBfiPOo+5vLpo4C0xHmmsnIhVhtuiL
3NfqCfmXw5XXgXDvGF56VJuP455kbbcGfrGIOF6eI54FqDs7CQlEyFy4OqGUcgZD
vS01hDN5xxWWE1o/W2Mnme2cHAQNP6lYWelONuDwxhVFJh6j3Sysi5igMbTOtl05
kXoRdelS6pKOgc23nrH9tjlL9ktAF2kMFHUy/kvHS2bxeh11Oo8=
=5PUl
-----END PGP SIGNATURE-----

robcio · September 2, 2020, 7:29am

Just a note, per the top of the grant, certain items are awaiting community feedback and at the moment the grant is not submitted but open for comment. I think it is important to have this conversation though given how centralized we are around solidwallet and this could be the right way forward.

If the above network topology makes sense as the solution, the main items that would impact the scope are the number of nodes we’ll be running and various feature sets to make the deployment more robust. Right now we are scoping basically all the things we think should be done given the sensitivity of these endpoints but would invite the community to offer any feedback.

Please let us know in the comments what you think about our plan and approach. Happy to offer any more technical details or talk about other approaches.

NorskKiwi · September 3, 2020, 6:37am

Sounds prudent to me. Hope this post gets due attention.

robcio · October 14, 2020, 1:17am

Just bumping this. Was hoping to get a few responses / feedback on approach. Few other teams are looking into this now and I think together we should all come up with a solution that makes sense. The best solution in all likelihood will be one that combines multiple approaches.

To be clear though, the solution I am proposing is largely a network topology suggestion but the solution can be adopted piecemeal. While the focus is on making the network more resilient, logically the solution for making the API infrastructure more robust will fall directly out of this with basically no additional work. I laid out all the steps a big company would likely have on their production checklist but some steps can be skipped / done hastily. For instance,

We don’t need to do rigorous load testing. A single load test will suffice which is simple to setup.
We don’t need to be multi-cloud right now, just AWS is fine. Only main point I would have is that the cloud has good kubernetes integration and offers instances with nvme storage as IOPS is going to be the biggest bottleneck.
There is little need to make this solution easy to adopt but that is a feature that can be added. Main focus of this work needs to be on maintaining as many 9’s of uptime (ie 99.999…%) to adhere to an SLA and the deployment process and management process needs to reflect that. We have been getting very good at building production grade automated deployments and would be drawing from a lot of best practice when building the tooling.

Please let me know your thoughts or any feedback on the proposal.

Emre · October 14, 2020, 4:51am

I was curious about the status on this since I saw the everstake one. I voted for it though. It’s needed and the aim was clear. I am curious why there is no response yet.

ICON_ADMIN · October 16, 2020, 7:48am

Hi, @shinyfoil we’ve discussed this proposal internally and came up with the below questions. Please take a look and let me know if you have any questions.

Current network sync is not working like the process that you suggested
Current network sync is working like below:
- Main P-Reps talk to each other and sync from other Main P-Reps via 7100 (gRPC)
- Sub P-Reps get the list of P-Reps from Solidwallet and find the fastest node nearby and sync

Questions:

Overall architecture
- Request details on how the private endpoint is configured and how the private endpoint and private sentry node communicate.
WireGuard
- VPN can be a single point of failure. Does the WireGuard server communicate directly with the peer?
- How the private sentry nodes talk to each other with WireGuard? Please let me know the flow
- How the private <> Public talks to each other with WireGuard? Please let me know the flow
- Do you have any plan to dualize the WireGuard server for the HA (High Availability)?
- Who is running the WireGuard server? Can you dockerize it?
Load Balanced Endpoints
- Are Consul and ElasticSearch used for health checks?
- What’s the plan to do health check for the nodes under the load balancer?
- It’s very difficult to do caching using the CloudFlare in terms of endpoint operation. Even if CloudFlare does GeoRouting, actually many physical nodes are required in the proxy method - do you have any alternative solution for that?

robcio · October 20, 2020, 3:14pm

Hi Bong,

Thank you very much for the questions and thoughtful response. I’ll answer the questions inline, but first just want to update you where our thoughts are on this grant.

This grant was originally conceived before ICON 2.0 was announced. Whatever we propose here should be inline with goloop networking capabilities if they change (more on this later).

Also when originally writing this grant, we thought that nodes were syncing from solidwallet. Now that we see that they are only using it for peer discovery, it would be less critical that they are scalable and more important that they are available. It also seems that a private network is in the near term less important than having failover endpoints for solidwallet, or simply making it more robust. In both cases, though, the basic infrastructure would be largely the same.

Given this, we thus propose we get started on an abbreviated version of this grant without the firewall to build / operate public endpoints. This would at least address the public endpoint operations and later we can customize it for the private network if needed. We can follow up with a grant to reflect this change and incorporate any other feedback.

Questions:

Overall architecture
- Request details on how the private endpoint is configured and how the private endpoint and private sentry node communicate.

The endpoints will be provisioned with Terraform on AWS, and either configured with ansible or run on kubernetes with helm. The ansible role is done and we also recently open sourced the prep-node as a helm chart. Instances with attached NVMe volumes will be used for the data volume. An L4 load balancer (NLB) will be used and traffic routed based on a customized health check. Prometheus will monitor the nodes and elasticsearch will log it.

Each private sentry syncs from a group of main preps and can establish a connection publicly or, if we aim to make the main-preps private, over wireguard. An endpoint is then configured with an L4 firewall that can be updated via an API based on changes in the network. We’re still evaluating options for setting up this firewall but we’d likely run it on kubernetes.

We want to emphasize that the wireguard implementation is optional at this stage and really just how tendermint-based chains and Polkadot handle sentries. Polkadot is actually deprecating their use this month with a networking layer improvement.

Also if possible, it would be best to contemplate VPN implementations in the context of ICON 2.0. We have an expert in libp2p at Insight (the networking stack used by Eth 2.0 / Parity / IPFS and more) who we’d love to rope into that conversation if you are interested. I’d defer to him for the actual judgement but I would imagine that we should be able to make all the ports private on the main p-reps if we used libp2p. Please let me know if this interests you and we can figure out what the best steps forward would be.

WireGuard

While we have experience with wireguard, there are several appealing alternatives including algo and nebula that we could evaluate as well. Open to any suggestion for this layer or even ignoring it in place of a simple firewall initially or even better, with network layer improvements.

VPN can be a single point of failure. Does the WireGuard server communicate directly with the peer?

In this case, the wireguard server is installed on the private sentry and the client on the main prep.

How the private sentry nodes talk to each other with WireGuard? Please let me know the flow

The private sentries within the same data center talk over private network without VPN. Connections across data centers are not needed, only to main preps for security.

How the private <> Public talks to each other with WireGuard? Please let me know the flow

Normally it is a matter of creating a token on the server that is moved to the client to establish the connection. We recently spent a couple days automating this process with ansible and it is very tricky and not really worth it for only a couple connections. The main use case where you need automation is when you have autoscaling groups where the connection needs to be established automatically on scale. For this we built an API to establish the connection automatically but have not open sourced it yet.

Do you have any plan to dualize the WireGuard server for the HA (High Availability)?

Yes, the plan is to run multiple wireguard / sentry instances in each region.

Who is running the WireGuard server? Can you dockerize it?

We would operate the server. Also we haven’t run wireguard in docker yet but it definitely would be possible.

Load Balanced Endpoints
- Are Consul and ElasticSearch used for health checks?

Generally we only use Consul for service discovery and build our health checks for JSON-RPC endpoints with a tool called health-checker. This can be dockerized and run as a side-car container or as an ansible role and pushed to galaxy as we have done in the past for polkadot. We don’t use elasticsearch in this capacity at this time.

What’s the plan to do health check for the nodes under the load balancer?

If you agree, we would do the same as we describe above. The AWS load balancer would then be hooked up to this health check. When we built this before, we checked the block height of a reference node to determine if the node is fully synced or not. For us we could insert any logic needed to validate the sync state.

It’s very difficult to do caching using the CloudFlare in terms of endpoint operation. Even if CloudFlare does GeoRouting, actually many physical nodes are required in the proxy method - do you have any alternative solution for that?

You are absolutely correct that caching is very difficult. As we see it, there is no simple / good solution around. We don’t think a classic cache with a generic replacement policy will work well apart from some of the most common requests but those are very few. We also haven’t looked too far into cloudflare as an option though it seems like it would be incredibly difficult to do.

We think the best way to operate a cache is if you inspect the rpc method and route the request to the appropriate cache. This is possible with nginx and some custom lua scripting but gets hairy really fast. No popular API gateways support json-rpc transcoding as far as we’re aware. Infura built a custom closed source reverse proxy called Ferryman to do this and Parity open sourced one called jsonrpc-proxy that we think will work well for this purpose.

How the methods are proxied from there is then a question of what requests are most common. Infura operates what it calls a near-head cache where it caches all the most recent data in which expires over time. This was in response to what they saw as their most requested queries and likely translates the same with ICON. Infura has then built a number of microservices to maintain those caches that rely on a block indexer to inform expiration policies on the cache.

Going this route would be a big commitment and we’d think it would need to be a community wide effort to support. We can lay out more details of our proposed caching implementation in another grant or as an extension to this one. It would be some simple solution based on the jsonrpc-proxy with a couple supporting microservices to handle a subset of queries.

Apologize for the long winded response but at this stage it is better to err on the side of verbosity than lacking detail. Let us know what you think and we’ll follow up with an updated grant application and bottoms up list of tasks.

ICON_ADMIN · October 29, 2020, 4:08pm

Hi team,

We decided to accept this proposal to have an additional endpoint powered by the community. Before we start this project, I’d like to suggest the below work scope and requirement. Since we’re planning to migrate the current blockchain to ICON 2.0, the network structure can change. Therefore, please suggest a modified proposal according to the requirements below.

Requirement

Supports routing and micro caching by decomposing JSON-RPC payload
- Provides JSON-RPC reverse proxy server (Nginx + LUA plugin, Envoy, Golang, etc…)
- Separation according to read/write method
- Develop the structure to support the micro caching (1 second)
- Implement IP based ACL
- Support dynamic ACL based on the validator list (P-Rep)
- Support web-socket
- Implement health checking logic for upstream
Distributed service environment based on Geo-Routing (composed of more than 3 continents)
Ensure the response speed of less than 100ms in any region
Dockerize it so that community can easily use it
Must be scalable and capable of real-time analysis JSON-RPC statistics by method (getBalance, icx_sendTranscation, getDelegation, etc…)
Suggest a redundancy plan for HA
Suggest a plan to respond to the DDOS attack
Suggest a plan to test the service
Opensource this project

Please check the above requirements and submit the modified proposal with the entire architecture diagram.

robcio · October 29, 2020, 8:42pm

Thanks @ICON_ADMIN. Much appreciate the feedback and will circle back with those details and a full cost breakdown of our intended work items. Likely this work will be staggered with some of the work for the eventeum equivalent service and graphql API which will affect some of the caching strategies.

Will follow up soon and thank you very much for this opportunity. Excited to make this happen.