1) Title
Blockchain Analytics and Data Pipelines
2) Project Category
Development: Building and hosting the analytics stack to analyze network adoption and smart contract activity.
3) Project Description
To track the growth and adoption of ICON, metrics need to be extracted, transformed, and loaded into performant databases for analysis. When the process is built right, insights can be gained in near real time to track items such as monthly active wallets or number of transactions for individual smart contracts and DApps over time. By analyzing the blockchainâs adoption, the community can gain insights into user behavior and act on them by directing resources more appropriately to increase transactions.
This proposal focuses on building a robust business analytics tool suite to support blockchain analytics. It will use the same architecture being used on the top blockchains per the blockchain-etl github organization by orchestrating data pipelines with Apache Airflow and inserting the data into various relational databases for further analysis. The project will then build dashboards with Apache Superset for analysis with a long term support for additional metrics requested by the community. These visualizations can then be embedded into various block explorers or used by individuals doing exploratory analysis.
4) Project Duration
10 weeks
5) Project Milestones
Initially the project will focus on high grading metrics for investigation and standing up infrastructure necessary for analysis. These two processes will be done in parallel and will take about two weeks complete. Next we will construct data pipelines, base tables, and query optimized tables. These databases will then be connected to a business intelligence dashboarding system to aid in exploratory analysis. After putting the process in production, we will package it up so that Insight data engineering fellows and members of the community can build additional metrics and visualizations.
Phase 1: Metrics High Grading
Milestones:
- Canvass community and foundation for a high grading of metrics to be collected
Deliverables:
- Consolidated list of metrics and associated tables needed to feed analytics
- DDL for initial SQL schema design and SQLAlchemy object model
Phase 2: Infrastructure Deployment
Note: this phase will be executed in parallel with phase 1
Milestones:
- Get infrastructure up in a pattern that can support multiple environments
- Build high throughput architecture for delivering analytics
- Selection of long term storage options and short term query optimized solutions
Deliverables:
- Terraform and Ansible to stand up Airflow, workers, OLTP DB, OLAP DB, and business intelligence dashboarding tools with automation
Phase 3: Data Pipelines
Milestones:
- Chain parsers and data pipelines feeding intermediary tables and data warehouse
- Database tuning and index optimization for high fidelity exploratory queries
Deliverables:
- A collection of Airflow DAGs to construct data pipelines
- Scheduled jobs to build reports and analysis tables
Phase 4: Data Visualization
Milestones:
- Build dashboards to support high-graded metrics from initial community canvassing
Deliverables:
- Deployment of Superset business analytics tooling with support for integration with Tableau
- Visualizations that can be wrapped in iframes and embedded into various block explorers and other tools the community uses to visualize metrics
6) Funding Amount Requested
$20,000
This covers all costs:
- All infrastructure expenses incurred during development
- Note: Operation of a high performance data warehouse can be expensive. We will assume these costs during development making efforts to keep long term operation of the data warehouse cost effective.
- Core team (Richard, Rob, and Mitchell)
- Additional code + documentation reviewers
- Upon completion of the grant, Insight will operate and maintain a public-facing long term storage database the community for 3 months for evaluation and followup with a cost plus accounting budget for long term operation
7) Official Team Name
Insight
8) Team Contact Information (e.g. e-email, telegram id, social media)
Email: insighticon.prep@gmail.com
Telegram ID: @robcio
Social Media: twitter.com/icon_insight
GitHub: github.com/insight-icon
9) Public Address
hxd278152d68a34baff1492d9abaf652fa02cbd088
Deep Dive
The primary goal of the project is to get data in the right format into a data warehouse that supports OLAP queries and hence can be connected to a business intelligence software such as Superset or Tableau. To do this, we will need a robust data pipeline to feed data into a OLTP database and then transformed into a OLAP database. After the data is in the OLAP database, visualizations and dashboards can be created.
The most important piece of the infrastructure is the data pipeline. By using Apache Airflow, we will have defined pattern that can be scalably be expanded on to support any number of additional metrics. Apache Airflow is the best in class tool for performing ETL (extract transform load) type of operations to support business analytics type of operations and will be the primary workflow orchestrator for the data pipeline.
These patterns are intentionally built for flexibility and extensibility, allowing any blockchain developer or data engineer to expand functionality, including future work by Insight contributors. Over the last year, Insight has onboarded 10 volunteer DevOps contributors to work in the ICON ecosystem, and this analytics stack will create myriad opportunities to engage our data engineers as well.
To determine which metrics will be most helpful to the ecosystem, we will canvass the community to high grade which analytics should be prioritized. With target tables properly scoped, we will begin designing a parsing strategy to build base tables that will feed into the actual tables that will be built for analysis. Depending on the type of analysis, different types of databases will evaluated. Generally distributed columnar databases such as Redshift and BigQuery are optimized for this type of analysis but can be expensive to run in the long term. As such, we will initially focus on an OLTP DB like Postgres for a long term storage option using it until we find significant performance degradation. We are expecting to hit bottlenecks when dealing with live filtering of time series data and will have a better understanding of the best data warehousing solution once the volume of data can be better assessed after the denormalization and indexing.
After the tables are assembled and indexed, we will connect them to a business analytics tool set to display dashboards for analysis. Insight will be focusing on development of Superset but will provide various members of the community with direct access to the database from which other business analytics tools can be connected such as Tableau or Spotfire. For embedding charts to public consumption, Superset will be the best tool for the job.
After the pipelines are built with dashboards to display the initial metrics, we will actively recruit Insight data engineers to expand this analysis process to account for additional metrics requested by the community.
Links to Relevant Open Source Repositories:
- insight-infrastructure/terragrunt-analytics
- Reference architecture to be used for deploying analytics infrastructure
- insight-infrastructure/terraform-aws-superset
- Terraform module to spin up Superset
- insight-infrastructure/terraform-aws-ec2-airflow
- Terraform module to spin up Airflow
- insight-decentralized-consensus-lab/Unified-Blockchains
- Python toolkit for cross-chain data analysis; extracts multiple blockchain types to single standard SQL database
Core Team
- Lead Engineer, Developer in Residence: Richard Mah, PhD
Richard is a developer in residence at Insight who worked on ICON infrastructure design, specifically in the development of advanced monitoring solutions for P-Rep nodes. He has deep experience in the intersection between advanced data analysis and high performance computing. During his PhD, he employed a wide variety of machine learning techniques to automate the analysis of clinical data with a microservices-based architecture on Kubernetes. Since then, he became an Insight fellow and has been focusing on building automated deployments of various blockchain related infrastructure components.
- Infrastructure Architect: Rob Cannon
Rob is Insightâs infrastructure architect at Insight who focuses primarily on blockchain and data analytics infrastructure. Many in the community might know him for his efforts in building automated deployments of infrastructure to be used across the ICON space from validator nodes to DApp testing platforms. Prior to Insight, he worked in the oil and gas industry founding a data science startup and working for a major operator. Rob has deep experience with all the tools described in this proposal which were core components in his last startup.
- Principal Investigator: Mitchell Krawiec-Thayer, PhD
Mitchell is Insightâs Head of Research for Developers in Residence, and is currently coordinating teams contributing open-source tooling and analyses for Polkadot, Zcash, Monero, and ICON. Mitchell founded the Monero Archival Project in 2017 and has been working with blockchain data sources and analyses ever since. Specializations include blockchain data engineering, data products, empirical analysis, on & off-chain anomaly detection, and blockchain data visualization.