Blockchain Analytics and Data Pipelines - Progress Report #3

shinyfoil · September 21, 2020, 8:19pm

Report Name

Blockchain Analytics and Data Pipelines - Report 3

Reporting Period

August 21 - September 19 (30 days)

Report Description

The ICON-ETL package has had its first full release (v0.1.0), and is available on Pypi! Integration with Airflow is underway, with DAGs working to extract and load the blockchain data into a Postgres database. The recent change to the ICON RPC Server (as part of the 1.5.0 release) has been fixed as part of the 1.5.1 release, however the bug did significantly delay our work on this project by nearly a month. Basic queries on data are functioning, with visualizations coming in the next week or so.

Project Completion Percentage

80%

Remaining Time to Completion

3.5 weeks

Expected Results for the Next Period

Next reporting period should see the final deliverables for the project which was delayed as mentioned above. We also have two fellows working on building analytics solutions on top of our process this session. Work from their projects will be integrated when their fellowship ends in 2 months.

Materials Proving Progress on the Project

icon-etl
- https://github.com/insight-icon/icon-etl
- https://pypi.org/project/icon-etl/
icon-etl-airflow
- https://github.com/insight-icon/icon-etl-airflow
Infrastructure
- https://github.com/insight-infrastructure/terraform-aws-ec2-airflow
- https://github.com/insight-infrastructure/terraform-aws-superset

Review of each KPI (Key Performance Indicator) or specific goal/milestone

Phase 1: Metrics High Grading (80%)

Milestones:

Canvass community and foundation for a high grading of metrics to be collected (completed)

Deliverables:

Consolidated list of metrics and associated tables needed to feed analytics (in progress)
DDL for initial SQL schema design and SQLAlchemy object model (completed)

Current progress update:

Final bugs are being worked out with Airflow, so the database should be ready shortly for community analytics use.

Phase 2: Infrastructure Deployment (80%)

Milestones:

Get infrastructure up in a pattern that can support multiple environments (nearly complete)
Build high throughput architecture for delivering analytics (in progress)
Selection of long term storage options and short term query optimized solutions (in progress)

Deliverables:

Terraform and Ansible to stand up Airflow, workers, OLTP DB, OLAP DB, and business intelligence dashboarding tools with automation

Current progress update:

IaC repos for Airflow and databases are essentially completed with some final bug fixes being worked on.

Phase 3: Data Pipelines (75%)

Milestones:

Chain parsers and data pipelines feeding intermediary tables and data warehouse (in progress)
Database tuning and index optimization for high fidelity exploratory queries (forthcoming)

Deliverables:

A collection of Airflow DAGs to construct data pipelines (nearly complete)
Scheduled jobs to build reports and analysis tables (forthcoming)

Current progress update:

DAGs to export and load blockchain data are essentially complete, with final bug fixes pending.

Phase 4: Data Visualization (50%)

Milestones:

Build dashboards to support high-graded metrics from initial community canvassing (forthcoming)

Deliverables:

Deployment of Superset business analytics tooling with support for integration with Tableau (complete)
Visualizations that can be wrapped in iframes and embedded into various block explorers and other tools the community uses to visualize metrics (forthcoming)

Current progress update:

Superset has been deployed and is connected to the dev database with basic visualizations being done.
Work is being started on developing useful visualizations.

ICON_ADMIN · September 28, 2020, 6:57am

Review Result Comments

Review Result

Approve

Review Comments

We’ve gone through overall your project and came up with the below requests to make it more useful. Please check this comment and let me know if you have any questions.

Please send a pull-request and merge your project into this Blockchain ETL · GitHub
Please provide a docker based environment
Please develop a DB connector to insert data directly into MySQL, Mongo DB, Postgres and etc
Could you explain the improvements compared to other protocols ETL project?
Add more description to the Readme so that developers can use this more easily
Add more description regarding exporting method
Add more description regarding DB schema

robcio · October 12, 2020, 9:47pm

Hi Bong,

Sorry for delayed response but wanted to give you fulsome response on each of these items which I’ve left inline.

Please send a pull-request and merge your project into this https://github.com/blockchain-etl 1

Sounds good. Having the repo over there would definitely gain ICON some exposure. The repo will need maintenance over time though so we’ll touch base with maintainers to discuss how to manage maintenance.

Please provide a docker based environment

Can you clarify what you should be dockerized, the icon-etl tool or airflow. We’d likely have follow-up questions from there as our production stack is VM based and does not use docker.

Please develop a DB connector to insert data directly into MySQL, Mongo DB, Postgres and etc

For DB connectors, just to clarify are you talking about from the icon-etl tool or an ability to push the data into mysql, mongo etc? For the icon-etl tool, the normal outputs are simple csv or json which are dumped to s3. This is how the blockchain-etl ecosystem tools work. From there the data is loaded into postgres. Supporting MySQL or Mongo is a matter of building an airflow DAG to push the data to these backends. The blockchain-etl ecosystem is built around postgres for it’s OLTP backend and biq query / redshift for OLAP which was the main goal of this grant.

Could you explain the improvements compared to other protocols ETL project?

We used the base blockchain-etl framework as a starting point and tailored it to properly export data from the ICON network. While there were no additional features added, the existing packages cannot just be pointed at an ICON endpoint and start exporting data. Specific customizations needed to be done to correctly interface and export the data.

Add more description to the Readme so that developers can use this more easily

In the Readme there is a link to the updated docs.

Add more description regarding exporting method

We have additional details in the docs.

Add more description regarding DB schema

We’ve included docs to the schema here.

Thanks again for the opportunity to work on this project. Let us know about items #2/#3 as well as ways to carry this forward and make the most use of these analytics tools. We are hosting the complete DB right now and have given access to some members of the community. We’ve also been responsive to requests made by @BennyOptions_LL and built some dashboards to demonstrate the process.