[Network proposal] Secure the network by degrading the Offline nodes

Hello everyone,

This is just a continuation of the topic that was discussed before by several teams and recently brought to our attention by the ICON4Education P-Rep team:

We believe that running the node should be a minimum requirement to participate in the network. While the onchain solution is in place for the main P-Reps, there is no instrument to govern the Sub P-Reps. For the same reason, we propose the following solution to be implemented on the protocol level to improve the network security by removing the dead nodes from the Sub P-Rep position until the nodes are restored:

  • I-Score is distributed every ~24h or 43,200 block

  • once a day, lets say at the daily block 21,600 all node statuses are read

  • If node status is in Blocksync or Offline, the latest block height is read from the node

  • If the node latest synced block is 30(days) x 43,200 blocks/day= 1,296,000 blocks less than the current block, the Node will stop being a Sub P-Rep and it will be demoted to Candidate status on the next I-Score epoch. We propose a 30 day period for a node to be offline until its demoted and stops generating the node reward.

  • If the node status is 0x0, node will be demoted to Candidate status on the next I-Score epoch.

If you think that running the node should be a minimum requirement to participate in the network, please support the proposal. If you think it shouldn’t be required, please present your reason.

If there is a better way to implement this onchain or we missed something, looking forward to hear.

And again, all credits to the teams that brought the subject many times before.

4 Likes

Approved.

We love your solution!

1 Like

Here is a reply to the question raised by @espanicon about the concern that this network proposal can potentially cause the network bottleneck. I copied the question and the reply here as it may be useful for the proposal discussion.

The multiple monitors and trackers are already querying the nodes every minute or less through the API calls. The similar procedures to remove main P-Reps when they miss the block is already in place. The function to change node status from Delegate to Sub P-Rep is already in place. In our proposal, the proposed time period is so long - once every 24h - that a protocol of ICON size should not even notice it.

Our intent was not to provide a bigger and more complex solution to the offline nodes problem. We are rather looking to provide an easy, simple, and practical solution for it. And that is where the real strength of our proposal lies, in its simplicity to sort the problem with a few lines of code.

20days and pending… @ICON_ADMIN

Thank you for taking the time to put this together. I agree it’s not ideal to be using network resources to pay for nodes that are not even online. We can begin to talk about solutions, but we should also talk about priorities.

Many realistic solutions to this require changes to the consensus mechanism. Right now I would say that implementing LFT2 to the public chain is a greater priority than developing solutions for this problem on the current consensus mechanism. Any solutions to this in the short term would also pull away resources from IISS 3.0 development. I agree it’s a problem and should be solved, but timing wise I don’t think now is the time.

And to respond to your solution specifically, I would not agree with it in its current design unless I am misunderstanding something. It appears you propose that all node statuses are to be checked once per term. In order to read node statuses there are two options I can think of, let me know if you have another idea for technical architecture that would make this more realistic:

1.) Have all 100 nodes report their status/block height to all Main P-Reps. This would take time and slow down the network once per day opening up a clear and predictable attack vector.

2.) Have all Main P-Reps send a request to an API from the Status Monitor (or another service) and await a response. This is faster than option 1 (though still creates a clear and predictable attack vector) but poses another issue: Our blockchain would be relying on an API response in order to continue functioning. If some nodes get a response from the API at slightly different times, they could get different responses and thus be unable to form consensus, which stops block production. Block production would also stop if the API went down.

You also propose a series of checks that every Main P-Rep must do prior to forming consensus. This adds additional calculation and would slow down our network. Remember that a block needs to be produced every 2 seconds.

I appreciate your effort, but when examining the technical architecture of this solution in more detail it does not appear realistic. I have gone through similar solutions in the past and arrived at the same conclusion.

My current top choice to solve this problem is doing what EOS does -> there is one block producing slot for sub p-reps. Sub P-Reps rotate into this slot. The specifics of selecting which sub p-rep comes into rotation have not been discussed. Before doing this it would make sense to try to lower the cost of the minimum specifications, given that all sub p-reps would then need to run infrastructure that could handle producing blocks for 1 term. Another option is to just raise the number of Main P-Reps to 100 over time, then this won’t even be an issue.

1 Like

Hey Scott @BennyOptions_LL ,

thank you for the reply. I am glad to hear that fact about offline nodes taking away resources is noted and acknowledged. The priorities you mentioned do have a sense and it is a good thing, because it gives us a time to come up with the good and practical solution while more important things are in development.

I am not a fan of solution 1. for the same reasons as you described. I don’t think it is necessary for such a heavy messaging solution in a short time frame for a simple reason: the original solution time frame is 1 day/30 day. The long period should eliminate the network congestion and there is no need for a consensus on the block time level.

I would like to hear and explore the possibilities of the second scenario. Again, the long period needs to be taken into account. Over such a long period high confident values can be achieved. One of the options could be the introduction of the cumulative daily flags. If for example, node accumulates 30 flags then we can talk about 22 main P-Reps checking a single sub P-Rep status. Rather than cross-checking everyone at the same time all the time, this should be a lighter way to do it. The final consensus can be achieved either onchain after main P-Rep checks or the lighter option to launch a Network proposal to achieve consensus is possible.

I believe the API unresponsive problem is not a big problem as it is very unlikely that over 30 days, API will stay unresponsive for the full period. This is based on the scenario that a successful Node Online status response will clear all previously built flags on the node.

I don’t think the EOS solution is possible at this stage. With all the recent rewards reduce, the requirement to run a full spec node will most likely make everyone with less than ~1m ICX votes running at the loss. Losing the nodes is not something in anyone’s interest.

There is one more thing I am unsure of how it works and if you could elaborate in more details I would be grateful. I mentioned in the proposal an option to provide a solution around the current process of how a node changes its status between Delegate and a sub P-Rep. How does it work? For newly registered nodes, if they have enough votes to enter the top 100 it takes 2 days for status to change. How is the era consensus achieved about who is eligible to enter the top 100 and changing the status from Delegate to Sub P-Rep and the opposite?

Just to confirm, because I wasn’t exactly sure what you meant by option 2, I will be responding to the below:

2.) Have all Main P-Reps send a request to an API from the Status Monitor (or another service) and await a response. This is faster than option 1 (though still creates a clear and predictable attack vector) but poses another issue: Our blockchain would be relying on an API response in order to continue functioning. If some nodes get a response from the API at slightly different times, they could get different responses and thus be unable to form consensus, which stops block production. Block production would also stop if the API went down.

This is not a good solution because it relies on an API to reach consensus. 67% of Main P-Reps must have the exact same copy of the blockchain in order for consensus to be reached. Everything we are discussing here must be included in blocks, meaning that all nodes must agree on it at the same time. If it’s not included in blocks, then it’s a off-chain solution (backend infra running on AWS for example) where we would need to manually punish nodes that don’t comply and have some sort of central point of failure for this punishment system (backend infra running on AWS for example).

What I’m hoping to clarify here is why a non-responsive API could bring down the chain very easily. P-Rep 1, P-Rep 2… and P-Rep 22 must all agree on the status of nodes every block, or whatever interval you propose it doesn’t really matter (maybe you’ll propose checking status every 43,200 blocks).

No matter what the time frame, everybody must agree at the same time. So P-Rep 1 might get a different response than P-Rep 22 from the API, given that things can change from request time to response time. Additionally, if the API were to go down at any time, there would be a problem reaching consensus. There are many issues with this solution regarding reaching consensus. If you want to think more about it, make sure that your proposal has a full-proof way to ensure that all Main P-Reps will reliably receive the exact same information.

Personally I like the EOS solution best. If minimum specs can go down to ~400 USD per month then the top 70 can still run at a profit even at the current market prices and even at the current commission rate. Even this is just a temporary solution, as I think at some point we’ll operate more similarly to Cosmos where we just have 100 (give or take) block producing nodes and no sub p-reps. With more block producing nodes the necessity for sub p-reps starts to disappear because the network can continue to operate even if a decent chunk of nodes go down.

What about this part? Are the main P-Reps checking the sub P-Reps to decide about the status change? I am unfamiliar with this specific process but if this is not a message heavy process than the solution may lean towards it.

All in all, I believe that ability to recognize offline nodes should be implemented as the core function of the system. I am a bit surprised to hear that there is not an easy solution for it.

I would say we all agree on this.

As for your question, I’m not exactly sure of the process of changing from main to sub, but all the data necessary to reach consensus on ranking (who is main/sub/candidate) is on-chain

Thanks for the feedback. As you said, it is a good thing that we all agree on it.

It seems for the technical implementation we will need someone from the technical team to find the best way of implementation.

Having in mind the current development and priority list, how do you recommend to continue with this?
Do you think its reasonable to submit the problem through a official network proposal for the governance vote? If it gets accepted, it would put in on the dev team dashboard.

At the moment of the writing there 14 out of active 78 Sub P-Reps with the node not working or being out of sync.

If there are no further objections, we will proceed and submit a network text proposal for the technical team to come out with the solution to prevent the offline nodes to receive network reward without satisfying a minimum requirement to run the node.

Sorry for the delayed response. The tech team is already working on the architecture of a solution, with plans to either increase the number of main p-reps, lower the number of sub p-reps, and/or have a rotating main p-rep slot that randomly picks a sub p-rep to produce a block.

None of these are short-term solutions and require more R&D. My recommendation is to hold off on this subject for a couple months until we are closer to being able to do something about it. The ICON team is 100% aware of the situation and will have a proposed fix for the problem when we are closer to having the resources to fix it. At that point, it would make sense to re-open the discussion on the forum, gather opinions, and do a vote on the next steps. I’d say you don’t need to put together a vote just to get it on their dashboard. I can definitely say it’s already there.

2 Likes

Thank you for the reply.

I understand the situation and will follow up the recommendation to delay the proposal and discussion until technical team reverts back with the possible solutions.

2 Likes