On Thursday July 9th, at approximately 11:15 AM PST (18:15 UTC), there was a major network disruption that threw a number of sub P-Reps into block sync. Later, at 2:25 PM (21:25 UTC), there was a larger disruption which took out the main ctz.solidwallet.io endpoint (which P-Reps rely on for syncing). Following that outage, which lasted for 1.5 hours, all nodes then began to sync from the endpoint and apparently overloaded it (to be verified by Foundation). In total, it took 48 hours after the initial incident began for all the nodes to sync back up.
I want to take this opportunity to highlight the fact that we need to be having discussions about how to mitigate this problem in the future. While I can’t speak to every blockchain ecosystem out there, from what I gather, having a centralized API endpoint that numerous parties rely on (e.g. ctz.solidwallet.io) is a widespread problem. For larger chains like Ethereum, they offer this endpoint via a paid service called Infura. If Infura goes down, so do large swaths of Ethereum dapps. ICON’s current infrastructure is precariously centralized to a single public endpoint that anyone can attack. It’s a natural target, and once disabled, the effects cascade to impact the actual health of the chain.
There are a number of ways to mitigate this, but brass tacks, I think something needs to be done in the near term, which I’ll get to later. In the long term, we can remove the dependence on this endpoint by offering a token based API-as-a-Service (similar to Infura) that tracks API requests and can allow dapps to run with out needing to manage complex infrastructure. Several teams have built their own load balanced endpoints, but again, reliance can’t be put on public nodes and these nodes need to have the ability to autoscale in adverse conditions (whether an attack, spike in network load, or accidental overload).
In the near term, we’ll be putting together a proposal to adapt work we did under a grant from the Web3 Foundation for the Polkadot network building geo-reouted, multi-provider, autoscaling load balanced endpoints. Lots of jargon in there, but basically it means we built some of the baseline functionality of Infura and automated the deployment on the four largest cloud providers. It’s all open source, so the community can use it freely, but I strongly believe that we need to deploy a private cluster is restricted to the P-Reps’ IP addresses to prevent a repeat of yesterday’s incident. With this, the critical syncing endpoint would be protected from attacks and is easily scaled in the event that there is a network disruption. After building and running this cluster, to secure the ICON backbone immediately, we can then talk about building the long term solution which would have a token based authentication method in front that all the DApps can then use.
Would love any thoughts on this before we draft a proposal. Not trying to be alarmist when raising these concerns but I think everyone knows what is at stake here.