[Grant Application] Blockchain Analytics and Data Pipelines

robcio · June 18, 2020, 7:38pm

1) Title

Blockchain Analytics and Data Pipelines

2) Project Category

Development: Building and hosting the analytics stack to analyze network adoption and smart contract activity.

3) Project Description

To track the growth and adoption of ICON, metrics need to be extracted, transformed, and loaded into performant databases for analysis. When the process is built right, insights can be gained in near real time to track items such as monthly active wallets or number of transactions for individual smart contracts and DApps over time. By analyzing the blockchain’s adoption, the community can gain insights into user behavior and act on them by directing resources more appropriately to increase transactions.

This proposal focuses on building a robust business analytics tool suite to support blockchain analytics. It will use the same architecture being used on the top blockchains per the blockchain-etl github organization by orchestrating data pipelines with Apache Airflow and inserting the data into various relational databases for further analysis. The project will then build dashboards with Apache Superset for analysis with a long term support for additional metrics requested by the community. These visualizations can then be embedded into various block explorers or used by individuals doing exploratory analysis.

4) Project Duration

10 weeks

5) Project Milestones

Initially the project will focus on high grading metrics for investigation and standing up infrastructure necessary for analysis. These two processes will be done in parallel and will take about two weeks complete. Next we will construct data pipelines, base tables, and query optimized tables. These databases will then be connected to a business intelligence dashboarding system to aid in exploratory analysis. After putting the process in production, we will package it up so that Insight data engineering fellows and members of the community can build additional metrics and visualizations.

Phase 1: Metrics High Grading

Milestones:

Canvass community and foundation for a high grading of metrics to be collected

Deliverables:

Consolidated list of metrics and associated tables needed to feed analytics
DDL for initial SQL schema design and SQLAlchemy object model

Phase 2: Infrastructure Deployment

Note: this phase will be executed in parallel with phase 1

Milestones:

Get infrastructure up in a pattern that can support multiple environments
Build high throughput architecture for delivering analytics
Selection of long term storage options and short term query optimized solutions

Deliverables:

Terraform and Ansible to stand up Airflow, workers, OLTP DB, OLAP DB, and business intelligence dashboarding tools with automation

Phase 3: Data Pipelines

Milestones:

Chain parsers and data pipelines feeding intermediary tables and data warehouse
Database tuning and index optimization for high fidelity exploratory queries

Deliverables:

A collection of Airflow DAGs to construct data pipelines
Scheduled jobs to build reports and analysis tables

Phase 4: Data Visualization

Milestones:

Build dashboards to support high-graded metrics from initial community canvassing

Deliverables:

Deployment of Superset business analytics tooling with support for integration with Tableau
Visualizations that can be wrapped in iframes and embedded into various block explorers and other tools the community uses to visualize metrics

6) Funding Amount Requested

$20,000

This covers all costs:

All infrastructure expenses incurred during development
- Note: Operation of a high performance data warehouse can be expensive. We will assume these costs during development making efforts to keep long term operation of the data warehouse cost effective.
Core team (Richard, Rob, and Mitchell)
Additional code + documentation reviewers
Upon completion of the grant, Insight will operate and maintain a public-facing long term storage database the community for 3 months for evaluation and followup with a cost plus accounting budget for long term operation

7) Official Team Name

Insight

8) Team Contact Information (e.g. e-email, telegram id, social media)

Email: insighticon.prep@gmail.com

Telegram ID: @robcio

Social Media: twitter.com/icon_insight

GitHub: github.com/insight-icon

9) Public Address

hxd278152d68a34baff1492d9abaf652fa02cbd088

Deep Dive

The primary goal of the project is to get data in the right format into a data warehouse that supports OLAP queries and hence can be connected to a business intelligence software such as Superset or Tableau. To do this, we will need a robust data pipeline to feed data into a OLTP database and then transformed into a OLAP database. After the data is in the OLAP database, visualizations and dashboards can be created.

The most important piece of the infrastructure is the data pipeline. By using Apache Airflow, we will have defined pattern that can be scalably be expanded on to support any number of additional metrics. Apache Airflow is the best in class tool for performing ETL (extract transform load) type of operations to support business analytics type of operations and will be the primary workflow orchestrator for the data pipeline.

These patterns are intentionally built for flexibility and extensibility, allowing any blockchain developer or data engineer to expand functionality, including future work by Insight contributors. Over the last year, Insight has onboarded 10 volunteer DevOps contributors to work in the ICON ecosystem, and this analytics stack will create myriad opportunities to engage our data engineers as well.

To determine which metrics will be most helpful to the ecosystem, we will canvass the community to high grade which analytics should be prioritized. With target tables properly scoped, we will begin designing a parsing strategy to build base tables that will feed into the actual tables that will be built for analysis. Depending on the type of analysis, different types of databases will evaluated. Generally distributed columnar databases such as Redshift and BigQuery are optimized for this type of analysis but can be expensive to run in the long term. As such, we will initially focus on an OLTP DB like Postgres for a long term storage option using it until we find significant performance degradation. We are expecting to hit bottlenecks when dealing with live filtering of time series data and will have a better understanding of the best data warehousing solution once the volume of data can be better assessed after the denormalization and indexing.

After the tables are assembled and indexed, we will connect them to a business analytics tool set to display dashboards for analysis. Insight will be focusing on development of Superset but will provide various members of the community with direct access to the database from which other business analytics tools can be connected such as Tableau or Spotfire. For embedding charts to public consumption, Superset will be the best tool for the job.

After the pipelines are built with dashboards to display the initial metrics, we will actively recruit Insight data engineers to expand this analysis process to account for additional metrics requested by the community.

Links to Relevant Open Source Repositories:

insight-infrastructure/terragrunt-analytics
- Reference architecture to be used for deploying analytics infrastructure
insight-infrastructure/terraform-aws-superset
- Terraform module to spin up Superset
insight-infrastructure/terraform-aws-ec2-airflow
- Terraform module to spin up Airflow
insight-decentralized-consensus-lab/Unified-Blockchains
- Python toolkit for cross-chain data analysis; extracts multiple blockchain types to single standard SQL database

Core Team

Lead Engineer, Developer in Residence: Richard Mah, PhD

Richard is a developer in residence at Insight who worked on ICON infrastructure design, specifically in the development of advanced monitoring solutions for P-Rep nodes. He has deep experience in the intersection between advanced data analysis and high performance computing. During his PhD, he employed a wide variety of machine learning techniques to automate the analysis of clinical data with a microservices-based architecture on Kubernetes. Since then, he became an Insight fellow and has been focusing on building automated deployments of various blockchain related infrastructure components.

Github, Linkedin

Infrastructure Architect: Rob Cannon

Rob is Insight’s infrastructure architect at Insight who focuses primarily on blockchain and data analytics infrastructure. Many in the community might know him for his efforts in building automated deployments of infrastructure to be used across the ICON space from validator nodes to DApp testing platforms. Prior to Insight, he worked in the oil and gas industry founding a data science startup and working for a major operator. Rob has deep experience with all the tools described in this proposal which were core components in his last startup.

Github, Linkedin

Principal Investigator: Mitchell Krawiec-Thayer, PhD

Mitchell is Insight’s Head of Research for Developers in Residence, and is currently coordinating teams contributing open-source tooling and analyses for Polkadot, Zcash, Monero, and ICON. Mitchell founded the Monero Archival Project in 2017 and has been working with blockchain data sources and analyses ever since. Specializations include blockchain data engineering, data products, empirical analysis, on & off-chain anomaly detection, and blockchain data visualization.

GitHub, Twitter, LinkedIn, Medium

Aspected · June 21, 2020, 8:07pm

What kind of analytics in specific do you think will be possible with this project?

robcio · June 23, 2020, 10:29pm

Thank you for the question. An initial inspiration was measuring monthly active wallets (any wallet that has had at least one transaction during that period) to understand ICON’s on-chain activity. The idea expanded quickly after that, as we developed methods and metrics for diving deeper into the data. Once active wallets are identified, we can explore their behavior and analyze how ICON is used in the real world. We can ask questions such as:

How active are individual (and aggregate) wallets?
How is transaction activity distributed (1 wallet making 100 txns, or 100 wallets making 1 txn?)
How is account and transaction volume distributed? (1 wallet/txn with 100 ICX, or 100 wallets/txns with 1 ICX?)
Are the wallets weighted high to low volume?
Which contracts are active wallets interacting with the most?
Are there periodic trends in activity based on the time of day/week/month/year?
(and much more)

Analyses like these enable us to to classify wallets and learn what types of wallet and transaction activities are driving the network, which will help us understand the needs of ICON’s high-value users. To do this we would build queries and visualizations around the on-chain data, connected to filters (by time, amounts, activity, etc), in order to explore historical and real-time ICON activity.

This is just one example of the types of analytics we would be setting ourselves up to perform but to be clear, our most important goal for this project is to set ourselves up to perform all the analytics over time with a professional stack that will be simple to expand on by the community or others. (For example, at Insight we have a large contributor pool of data scientists and engineers who would easily be able to continue expanding the metric list and analysis stack based on further community requests)

Hope that answers your question and let us know if you have any suggestions! We appreciate any input from the community on additional areas of investigation.

Aspected · June 25, 2020, 9:28pm

Interesting ideas. I do think there is a potential problem by trying to analyze active wallets though:

A lot of ICONists are active HODLers which would suggest inaccurate data if you count wallets by monthly transactions. I think there’s also the problematic consequence of counting monthly transactions as spam occurs every now and then; Just a year ago we had a competition ran by IF and it spiked the daily transactions in the millions. I see that you guys might have some background expertise in Machine Learning which should be able to filter out spam, but I just wanted to mention it anyways.
Below are some potential ‘oppositions of the proposition’ that I could think of:

A whale holding 10m ICX in 20 wallets that he actively uses: Does that count as 50 users?
Spammers, bots of some kind outputting a very low number of transactions.
HODLers that haven’t done a single transaction the past 8 months but still actively are involved (How can they be included?)

Speaking of Machine Learning, do you plan on analyzing the data that is on the transaction itself (Eg: Looking for keywords such as ‘weather’), or are you guys more or less focusing purely on the transaction metrics?

I agree in general that this would be very useful as you said above with helping us understand what users are looking for. Since everything is transparent and the data can be read & analysed: There is a unique opportunity to utilize that. My hypothesis is that the Entertainment industry & DAO W/DeFi are the two most impactful areas that blockchain can be applied to, but it’s still just a hypothesis and data analysis is objective.

I’m very hopeful to see you guys getting this project funded. Good luck.

+Support

Ben_Lee · June 26, 2020, 5:53am

Initial Review Result Comments

Review Result

Approve

Review Comments

We are excited to see the analysis on our blockchain that you can provide and hope to gather many insights from this service.

Next Procedure

ICON Foundation provided the Initial Review Result Comments on the proposal. Grants recipients will respond to the Foundation’s response. Depending on your comments, the foundation will decide the amount of grants to pay and provide the Final Review Result Comments.

mitchellpkt · June 28, 2020, 7:43pm

I absolutely agree, @aspected. Your scenarios are perfect examples of why wallet counts are not the same as user counts, and likewise why transaction volume is not a linear indicator of adoption. We can still derive ensemble insights from aggregate behavior (fictional example: “the data show a trimodal distribution of transaction frequency - inactive wallets of investors, day-to-day users, and high-frequency minute-to-minute wallets”). However, it’s crucial to recognize that the ratio of wallet classes does not tell us the ratio of user types, since there isn’t a 1-to-1 relationship between individuals and wallets. I tend to be rather pedantic about rigor in statistical generalizations. :- )

Yes, our analytics stack will parse the data field so that it can be queried, including substring matches. This should provide the data source foundation for a variety of text analysis and NLP studies. I could imagine a few interesting extensions:

Generating word clouds from ICON transactions (with a filter for technical terms)
Cluster data field contents by (Levenshtein?) string edit distance to identify common data payload patterns and their inter/intra-variations
Plot evolution of these signatures over time (e.g. convert word cloud to bar chart race)

Always appreciate your input, let me know if you have any more ideas!

-:- Mitchell @ Insight

mitchellpkt · July 1, 2020, 3:09pm

Thank you @Ben_Lee and the ICON Foundation for supporting this project! We’re beginning development this week, and will poll the community shortly about high-priority analytics.

We really appreciate this opportunity to contribute, and look forward to further discussion and feedback.

Best,
Mitchell, Rob, and Richard

ICON_ADMIN · July 3, 2020, 11:53am

Final Review Result Comments

Review Result

Approved

Reporting Period

Grant recipients should produce Periodic Reports in order to continue to receive grant funding. If the Periodic Report is showing that the initiative is not meeting expectations, the ICON Foundation has the option to halt continued funding. Please check the link (ICON Community Grant Program (“ICON CGP”)) and leave reports on this page (https://forum2.icon.community/c/g/pr)

The team must submit the Periodic Report by the timeframes below (UTC):

2020/07/22 02:00 AM (UTC)
2020/08/20 02:00 AM (UTC)
2020/09/20 02:00 AM (UTC)

Funding Amount

The ICON Foundation is going to give out $20,000 in 3 monthly installments based on the Periodic Report. The amount of ICX will be determined and paid according to the price at 02:00 AM (UTC) on the day of payment.

2020/07/24 02:00 AM (UTC) - $6,666
2020/08/24 02:00 AM (UTC) - $6,666
2020/09/24 02:00 AM (UTC) - $6,667