Network disruption and discussion on ways to mitigate

On Thursday July 9th, at approximately 11:15 AM PST (18:15 UTC), there was a major network disruption that threw a number of sub P-Reps into block sync. Later, at 2:25 PM (21:25 UTC), there was a larger disruption which took out the main ctz.solidwallet.io endpoint (which P-Reps rely on for syncing). Following that outage, which lasted for 1.5 hours, all nodes then began to sync from the endpoint and apparently overloaded it (to be verified by Foundation). In total, it took 48 hours after the initial incident began for all the nodes to sync back up.

I want to take this opportunity to highlight the fact that we need to be having discussions about how to mitigate this problem in the future. While I can’t speak to every blockchain ecosystem out there, from what I gather, having a centralized API endpoint that numerous parties rely on (e.g. ctz.solidwallet.io) is a widespread problem. For larger chains like Ethereum, they offer this endpoint via a paid service called Infura. If Infura goes down, so do large swaths of Ethereum dapps. ICON’s current infrastructure is precariously centralized to a single public endpoint that anyone can attack. It’s a natural target, and once disabled, the effects cascade to impact the actual health of the chain.

There are a number of ways to mitigate this, but brass tacks, I think something needs to be done in the near term, which I’ll get to later. In the long term, we can remove the dependence on this endpoint by offering a token based API-as-a-Service (similar to Infura) that tracks API requests and can allow dapps to run with out needing to manage complex infrastructure. Several teams have built their own load balanced endpoints, but again, reliance can’t be put on public nodes and these nodes need to have the ability to autoscale in adverse conditions (whether an attack, spike in network load, or accidental overload).

In the near term, we’ll be putting together a proposal to adapt work we did under a grant from the Web3 Foundation for the Polkadot network building geo-reouted, multi-provider, autoscaling load balanced endpoints. Lots of jargon in there, but basically it means we built some of the baseline functionality of Infura and automated the deployment on the four largest cloud providers. It’s all open source, so the community can use it freely, but I strongly believe that we need to deploy a private cluster is restricted to the P-Reps’ IP addresses to prevent a repeat of yesterday’s incident. With this, the critical syncing endpoint would be protected from attacks and is easily scaled in the event that there is a network disruption. After building and running this cluster, to secure the ICON backbone immediately, we can then talk about building the long term solution which would have a token based authentication method in front that all the DApps can then use.

Would love any thoughts on this before we draft a proposal. Not trying to be alarmist when raising these concerns but I think everyone knows what is at stake here.

7 Likes

We fully support the notion Rob mentioned and would love to see proposal to improve the situation. To put a thing into perspective, at the peak of the network disruption 49 out of 100 nodes were down and struggled to blocksync with the network (2 out of 22 main P-Reps and 47 out of 78 sub P-Reps).

1 Like

Hey Rob I am not capable of anything right now but from my scroll I fell like this is more about keeping nodes in sync while our bot node lagged a bit in that frame luckily we didn’t fell that behind. Anyway my point was local nodes for application since network + applications requires them I feel like what you are talking is making these nodes stronger. I like to talk in details when we have a common time

Great post, thanks! Can someone skilled please explain to us if there is any crossover with what Pocket network are aiming to achieve and what you’ve advised here?

Hey @Emre, the stats were lifted from a Grafana dashboard that has all the blockheight info for all the nodes in a time series. We’ll be making that public soon so that node operators get a detailed historical view of their sync status. Helps to see the rate of syncing in these situations to figure out when nodes will be coming back online or if they are truly stuck and just chasing blocks.

Happy to chat about solutions as well. My main thing with all of this is that these nodes can scale fast enough. If nodes take 2 hours to scale, that won’t help us much if we get attacked. Otherwise we have to over provision clusters which will be expensive. I’ll put details of this into the proposal.

@NorskKiwi So from my understanding is they are working mostly on the protocol level, not infrastructure. Not an expert (perhaps someone from their team can weigh in), but I thought they are trying to incentivize people to run citizen nodes and you will get paid in their token. It assumes you know how to run the node. One of the upsides to using their network for this is that they have a token authentication layer to access the API which will make it harder to attack but still has vulernabilities that a private endpoint would not have.

My main point with this is that we should be thinking about what we need in the future vs now. If there is widespread adoption of the pocket/figment coin then this could be a viable solution if enough nodes join the network (still would need to research and know more). What we would be doing though is purely on the infrastructure layer providing both a managed endpoint alongside a simple to use solution to spin up and manage these nodes in production. If we have enough people running and using citizen nodes, it makes sense to have a protocol layer like pocket/figment’s or some other type of layer to compensate node operations like Infura which prices everything in fiat.

I understand your aims and that’s important and that would probably help us reduce our cost to feed our products. I really like to talk the details but I learned my situation is much worst than what I was expecting. According to report my operation had several complications and takes 2x more time than normally it should be. I fell ok but it’s cutting me out of a lot of stuff. I think I will be back when you already produce your solution and the only thing we can do at that time will be implementing it :slight_smile:

Hey @Emre, I think applications need their own endpoint in the long run. For now what we would be proposing would aim solely on the preps with firewalls restricting access to registered nodes. Long term though we’d be looking to support application developers as it is the same thing, just different firewall settings. Will discuss more in next post.

So I just wanted to update everyone paying attention to this that I had a great discussion with Michael O’Rourke from Pokt network. Definitely a lot of ways for us to work together. Copying him to this thread so he can add to the conversation but from what we discussed, Pokt network seems like a very good option for ICON API nodes. We have all been asking what we can do with the idle resources for the sub-preps and Pokt might have the right solution to both discover these nodes and route traffic down to them while rewarding those node operators for serving API requests.

What did come up in discussions though is that while Pokt looks like a fantastic long term solution, if we want to address this problem in the near term it would be prudent to get a private endpoint up. If we were to build it at Insight, it would all deploy automatically and the work would serve as a basis point for others to easily run nodes, both Pokt based or otherwise.

Going to leave details on the execution for the actual proposal but just wanted to get any last feedback before we draft it in the next few days.

About your point the pocket’s solution is good and easy way for developers with no node experience or didn’t want to hastle with load scaling type of stuff on that part. Personally I always prefer local node for performance and many other reasons. Because it has single point of failure that’s our problem. Rhizome also deployed several clusters and put a load balancer in front of that. We will also give that a try soon to measure performance. Lastly I agree there are some resource on network by sub prep nodes and from or monitor high percentage of them doesn’t look stable.

Thanks for pointing me here Rob.

To follow up on this thread (apologies on the tardiness), it makes a lot of sense for Insight and potentially with some other provider’s assistance to create some endpoints ASAP for redundancy purposes to ensure this issue doesn’t happen again in the near term.

After speaking with Rob I wholeheartedly agree that there is potential for some long term collaboration to help ICON decentralize the problem and create an additional incentive for sub-preps.

Correct me if I’m wrong, but by sub-preps running a Pocket Core node and having them access a decentralized endpoint via Pocket would help solve the issue at hand and provide an additional revenue stream for sub-preps.

We just launched our mainnet a week ago from today with Ethereum and native Pocket support. We expect to be adding new chains within the next 30 - 60 days, and would be happy to assist in a proposal that would add ICON as a supported chain within Pocket.

Hi @o_rourke.

Thanks for your insights and agree with everything there and would love to support. The sub-preps should have an option of running Pocket Core and looking forward to making that happen.

Got super backed up last few days but will get that first draft proposal out soon. Thanks for your insights and talk soon.