[Grant Application] Autoscaling Endpoints V1

robcio · November 19, 2020, 7:27pm

1) Title

Autoscaling Endpoints V1

2) Project Category

Infrastructure: for supporting the underlying code base of the blockchain - Infrastructure supporting tools, bug patches, node maintenance tools, etc.

3) Project Description

This project aims to build a globally distributed set of geo-routed autoscaling endpoints for the ICON blockchain. It will be developed completely in code such that new users can easily adopt the architecture or modify it to suit their needs. It is the first part in a series of grants aimed at upgrading the API infrastructure supporting ICON and handing off support of the network’s endpoints to the P-Rep community. As such, there will be two versions developed, both open source but with different objectives. The first version is a VM based solution that is easily deployed and operated by any community member. The second will be a more robust kubernetes based deployment for the main endpoints and experienced operators.

The community based deployment will be an extension of the existing automated P-Rep deployment with two flavors, a single node and multi-node deployment. The single node deployment is based on an ansible role to configure the node and is essentially done with minor modifications needed. The multi-node deployment needs to have Packer integrated with a load balancer along with a custom health check to inform routing. Both deployments will be provisioned with terraform on AWS with options to extend to GCP or other clouds where the features are matching on community request.

The main endpoints will be primarily deployed on Kubernetes and leverage the autoscaling groups developed in the community modules for application nodes. There will be three clusters deployed geographically distributed and georouted through cloudflare with a goal to achieve a response time under 100ms. A fast sync agent will be deployed such that scaling of the cluster can happen as fast as possible.

All deployments will include sidecar prometheus monitoring exporters and logging agents. Integration tests will be built with terratest and run in CI. Endpoints will be later extended to include websockets with ICON 2.0 and any additional APIs being built by the community as laid out in the adjacent stream data processing grant.

4) Project Duration

Development over 3 months
Operation for 3 months as soon as development is finished

5) Project Milestones

All milestones built from a bottoms up estimate of tasks. All items open to feedback on priority with associated trade-offs in risks running the endpoints.

Phase 1 - Community Modules - 3 Weeks

Initial development will focus on a set of robust community modules to deploy both single nodes and clustered environments. They will be used later in the main deployment due to their ability to scale with locally attached NVMe storage.

When operating autoscaling groups, it is important to be able to minimize scaling time. Currently fast sync options take up to 2 hours from node bootstrap largely due to how the whole blockchain is zipped up. As the chain grows, this time will grow. This forces node operators to use volumes 75% larger than the actual chain data due to downloads hitting the buffer and additional time to unzip the content. To minimize this, we will build our own sync agent to take incremental uncompressed backups of the blockchain and store them in an S3 bucket within the region of the cluster. A sync client sidecar container will be initially used with the code eventually integrated into the main container distribution with foundation approval for fast sync options. We are aiming to bring scaling time down to under 10 minutes.

The deployment will generally consist of a single terraform module which calls an Ansible role that is uploaded to Galaxy. This role then pulls a repo with a docker-compose and renames appropriate override files to run the application in the desired configuration. Users will be able to use any component of the stack independently (ie run the ansible against any desired host) though the terraform will generally be suitable for most users.

Docker Compose Stacks
- Repo with base application and override files
Sync client and agent (source of truth node)
Ansible Roles
- Main application (updates)
- Custom health check
- Source of truth node
AWS Modules
- Single node module
  - SSL with certbot
- Autoscaling group module
  - Network load balancer with SSL tied to domain in Route53
GCP Modules (Optional - Add 2,000$ to project)
- Same components AWS deployment

Phase 2 - Main Deployment Modules - 1 Month

The second phase will focus on elements needed to bring the endpoints into production including logging, monitoring, and alerts. For logging, an ELK stack will be used to show status code and request type (method) for all requests. For monitoring, a prometheus stack will be used with a custom exporter to get metrics from the node and application. Alerts will then be built from Elasticsearch queries using Elastalert and metrics with Alertmanager then forwarded to Slack and Telegram channels. Grafana and Kibana dashboards for each region will be available.

The deployment will be built on kubernetes and will either route traffic down to the citizen nodes from the community module or to pods running inside the cluster. The decision will be based on tests run on using local volumes in kubernetes which is in active development though ultimately the goal will be to run entirely within kubernetes using the P-Rep application helm chart we developed. Using kubernetes will allow us to more easily integrate production features as they become available such as websockets and an L4 firewall for a private sync point.

On completion, we will run a single region endpoint and offer it to the community.

Kubernetes cluster (AWS will be primary development - GCP on request)
Kubernetes baseline setup
- Service accounts
- Prometheus operator
- Logging / Elasticsearch operators
- Consul (if using ASG)
Fluentd and Elasticsearch configurations
Grafana and kibana dashboards
Alertmanager and Elastalert alert configurations

Phase 3 - Master Cluster Deployment - 2 weeks

In order to achieve global redundancy and minimize latency between requests, three clusters (US, Europe, Asia) will be deployed. Cloudflare will be used for georouting with aggregated health checks bubbling up from each region to inform routing. A single master cluster will be developed to take high level metrics from each region and perform various cross-cluster elasticsearch queries to build a complete map of the use and health of the endpoints. A status page will be built to display this information publicly. Lastly a load test will be performed to understand thresholds for use and scaling in the event of a DoS attack.

Status page
Alerta feeding status page and distribution of alerts
Load test with Locust
Multi-region production deployment

Phase 4 - Operations - 3 months

Once developed, the endpoints will be operated for 3 months with continued operational expenses priced at cost.

6) Funding Amount Requested

Requested amount falls into two budgetary categories, development and operations. For development we are requesting 25,000$ USD with options to extend based on GCP inclusion (2k$) and feature requests. For operations, we are requesting a baseline operations fee of 1,500$ USD / month plus at cost expenses (~2k$/mo estimate) to be provided as a report. All improvements made to the cluster will be integrated into the main codebase so that best practices can be shared and adopted by the community.

7) Official Team Name

Insight

8) Team Contact Information (e.g. e-email, telegram id, social media)

Email: insighticon.prep@gmail.com

Telegram ID: @robcio

Social Media: twitter.com/icon_insight

GitHub: github.com/insight-icon

9) Public Address (To receive the grant)

----BEGIN PGP SIGNED MESSAGE-----

Hash: SHA256

hxd278152d68a34baff1492d9abaf652fa02cbd088

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEfeqitHBQR7zWx7uGhofO4QrZt98FAl9JQHsACgkQhofO4QrZ

t98XXhAAlLPoKISf9LJMtCUS6flV1kpNqJb3iPvfGObNZqYpwcwYSeRObszgUz5b

aqOwz1JzDnrW0RrS/S9KGPawkT7byH16iu/c0EpU7vmgs/hSNmXLxbjdKHt/uEqs

pLjQ1YqmnglO7kNHYTD3bSe+uX1kngw9KaVLUzn1jRmVMB3avqf5PrFuYRry+elj

DdBPpg9KyLbmadFU3axtYLC25uA6n0LsbBbBMD8NZHzYzcOzz82t89iF+ws/iKy7

9oLdEKNAXv24PSlrX06dv7+zwyEyKnG+4h5KNz+MnsZlkcoweOCRsU+E71ZSwa5h

6ZjZ5VxWINuB5lvCHT4H1MkDFlt4OP2uTqFNVsxS7MNQQg8OFGvIJHooFomyoNL8

MIV/pAx1aPfYL3i6YN+usdxBCFdoClw69aGFPslusSzutXNvhrDW/j9cUhKZ3Xsh

qph8Ywj+tvKEPZOeLAh3B4+2KEUBcoyS+AfBfiPOo+5vLpo4C0xHmmsnIhVhtuiL

3NfqCfmXw5XXgXDvGF56VJuP455kbbcGfrGIOF6eI54FqDs7CQlEyFy4OqGUcgZD

vS01hDN5xxWWE1o/W2Mnme2cHAQNP6lYWelONuDwxhVFJh6j3Sysi5igMbTOtl05

kXoRdelS6pKOgc23nrH9tjlL9ktAF2kMFHUy/kvHS2bxeh11Oo8=

=5PUl

-----END PGP SIGNATURE-----

ICON_ADMIN · December 8, 2020, 6:07am

Initial Review Result Comments

Review Result

Approve

Review Comments

Can you add a more detailed description for each task? Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

Next Procedure

ICON Foundation provided the Initial Review Result Comments on the proposal. Grants recipients will respond to the Foundation’s response. Depending on your comments, the foundation will provide the Final Review Result Comments.

Plus, all grant recipients should do KYC in order to receive grants from the Foundation. Once your proposal is fully accepted, we’ll guide the next procedure for the KYC.

robcio · December 12, 2020, 3:22am

Hi @ICON_ADMIN - I have updated the task list with more details per your request. We’ve also made some small adjustments planning the alerting topology with some improvements we are bringing over from another project.

Let me know if there is any other information, otherwise we’ll be starting on the first milestones shortly.