NOTE: This proposal has still items awaiting comment from community feedback in brackets.
1) Title
Private Load Balanced Endpoint Backbone
2) Project Category
Infrastructure: for supporting the underlying code base of the blockchain - Infrastructure supporting tools, bug patches, node maintenance tools, etc.
3) Project Description
As was outlined in our forum post, ICON’s current centralized endpoint infrastructure presents a critical single point of failure for the entire network.
Having a centralized API endpoint that numerous parties rely on (e.g. ctz.solidwallet.io) is a mission-critical risk. As seen, if this endpoint goes down, it results in a cascading set of problems that impact the whole network. This grant aims to mitigate these problems.
As we see it now (awaiting further discussions with the Foundation), the core of the problem is with the central solidwallet endpoint, which is what all the nodes on the network by default sync off of. If it goes down, all the sub-preps go down. If enough main preps go down after that, the network will stop functioning as they, by default, sync off of solidwallet. Even if a main prep maintains their own citizen node, it would need coordination between main preps such that they are syncing off each other which in of itself creates additional problems.
What we see is needed is a private endpoint that is syncing off several voluntary main preps and serves as a common sync point for the rest of the critical infrastructure. If we focus on making that common sync point as robust and fault tolerant as possible, all the subsequent API infrastructure would be able to sync off it. It is important that this sync point is private such that attackers are unable to access it. We would also be able to potentially make the REST API port 9000 private for main preps as the rest of the network will no longer need access to that port.
After hardening the sync point, further decentralization of the API interface can proceed, particularly in cooperation with Pokt network who appear to have a solid solution to convert the sub-preps into load balanced API nodes and be further compensated for operating a node on the network. These nodes could then take over the load of solidwallet as we prove out its stability on the network. Regardless of if the network fully embraces Pokt as an API provider, having a private network would be critical in making the network more stable.
This type of architecture is not entirely uncommon in other top performing ecosystems. Cosmos, for instance, employs a private sentry layer where nodes are connected to validators over a secure VPN connection. We have implemented such a setup with WireGuard for other ecosystems with this grant laying the groundwork for a follow-on grant to implement this setup. While Cosmos’ networking protocols are different than ICON’s, the protection layer, in practical purposes, is serving the same purpose.
All development of the infrastructure will leverage prior work done for the Polkadot network, where we built a set of load balanced endpoints on each of the major cloud providers. A critical consideration is taken with auto-scaling behavior as baseline operations should be relatively lean but the endpoints need to be able to scale to meet whatever demand. Crucially, scaling needs to be possible in a reasonable amount of time to make it effective. Currently sync time is roughly 2 hours which can be reduced to under 15 minutes if handled by a custom content delivery network that is updated on a regular basis by a source of truth node. We have already implemented this methodology in Polkadot and were able to bring the scaling time down to 5 minutes which makes scaling simple to manage. We intend on implementing this setup on our own and could operate it in the long term such that all nodes could sync and recover faster.
While the proposal is aimed mitigating the specific threat vectors outlined above, by building it with automated deployment tooling we will be able to offer the community a one-click deployment of citizen nodes for DApps, thereby decreasing the reliance on the public solidwallet endpoint. This type of tooling will come with minimal effort when executing this grant. All components will be built with a combination of Terraform, Ansible, Helm, and fronted with a CLI to walk users through the deployment process.
4) Project Duration
4 Months
- Development over 4 months
- Operation for 12 months as soon as development
5) Project Milestones
Phase 1 - Infrastructure as code modules - 2 Months
- Develop terraform modules for AWS, GCP, and Cloudflare with continuous integration testing
- 2 repos for auto-scaling group modules, scaling policies, and load balancer (AWS + GCP)
- Terraform + Packer + Ansible
- 1 or 2 repos for agents to update firewalls on service discovery of P-Rep’s node IP via API
- Exact firewall implementation still under consideration.
- 2 repos for Kubernetes clusters (AWS + GCP)
- 1 repo for cloud-init scripts
- 1 repo for library node and CDN for fast scale and sync capabilities (AWS)
- Current sync time is roughly 2 hours. We aim to bring sync time down to less than 15 minutes to support effective autoscaling capabilities
- 1 repo for Helm charts with Nginx, cert manager, Prometheus, Grafana, Alertmanager, Consul, and Elasticsearch
- Alarms on each instance on important metrics
- 1 repo for Cloudflare routing and load balancer
- 2 repos for auto-scaling group modules, scaling policies, and load balancer (AWS + GCP)
Most of phase 1 is relatively off the shelf except for the L4 firewall that needs to be custom made to be responsive to peer discovery on the network. Because of the mission critical nature of this infrastructure, extra considerations need to be made to harden this infrastructure.
Phase 2 - Testing and optimization - 1 Month
- Deploy solution on testnet
- Deploy exact configuration as expected on mainnet
- Load testing of endpoints with Locust to determine proper scaling policies
- Report findings and recommendations for securely operated node topologies
Phase 3 - Integration - 1 Month
- Integration into current one-click deployment solution to allow anyone to spin up their own load balanced API endpoints within minutes
- Contribution to prep_docker to make endpoint default option
Phase 4 - Deployment and Operations - 12 Months
- Deploy [1] cluster on both AWS and GCP and perform geo-routing through cloudflare
- Operate clusters with at least [3] nodes nvme backed VMs with [4] cores each and a scalable routing layer
- Georouting through cloudflare
6) Funding Amount Requested
Requested amount falls into two budgetary categories, development and operations. For development we are requesting [X$ USD]. For operations, we are requesting a baseline operations fee of [X$ USD / month] to cover expected AWS and management costs along with at-cost accounting for any additional necessary scaling that will be needed.
7) Official Team Name
Insight
8) Team Contact Information (e.g. e-email, telegram id, social media)
Email: insighticon.prep@gmail.com
Telegram ID: @robcio
Social Media: twitter.com/icon_insight
GitHub: github.com/insight-icon
9) Public Address (To receive the grant)
----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
hxd278152d68a34baff1492d9abaf652fa02cbd088
-----BEGIN PGP SIGNATURE-----
iQIzBAEBCAAdFiEEfeqitHBQR7zWx7uGhofO4QrZt98FAl9JQHsACgkQhofO4QrZ
t98XXhAAlLPoKISf9LJMtCUS6flV1kpNqJb3iPvfGObNZqYpwcwYSeRObszgUz5b
aqOwz1JzDnrW0RrS/S9KGPawkT7byH16iu/c0EpU7vmgs/hSNmXLxbjdKHt/uEqs
pLjQ1YqmnglO7kNHYTD3bSe+uX1kngw9KaVLUzn1jRmVMB3avqf5PrFuYRry+elj
DdBPpg9KyLbmadFU3axtYLC25uA6n0LsbBbBMD8NZHzYzcOzz82t89iF+ws/iKy7
9oLdEKNAXv24PSlrX06dv7+zwyEyKnG+4h5KNz+MnsZlkcoweOCRsU+E71ZSwa5h
6ZjZ5VxWINuB5lvCHT4H1MkDFlt4OP2uTqFNVsxS7MNQQg8OFGvIJHooFomyoNL8
MIV/pAx1aPfYL3i6YN+usdxBCFdoClw69aGFPslusSzutXNvhrDW/j9cUhKZ3Xsh
qph8Ywj+tvKEPZOeLAh3B4+2KEUBcoyS+AfBfiPOo+5vLpo4C0xHmmsnIhVhtuiL
3NfqCfmXw5XXgXDvGF56VJuP455kbbcGfrGIOF6eI54FqDs7CQlEyFy4OqGUcgZD
vS01hDN5xxWWE1o/W2Mnme2cHAQNP6lYWelONuDwxhVFJh6j3Sysi5igMbTOtl05
kXoRdelS6pKOgc23nrH9tjlL9ktAF2kMFHUy/kvHS2bxeh11Oo8=
=5PUl
-----END PGP SIGNATURE-----