Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance reliability of loadbalancer #17

Open
flannoo opened this issue Oct 8, 2024 · 3 comments
Open

Enhance reliability of loadbalancer #17

flannoo opened this issue Oct 8, 2024 · 3 comments

Comments

@flannoo
Copy link

flannoo commented Oct 8, 2024

Would it be possible to add an additional check in the loadbalancer, so it determines that a node is responding in a timely fashion and it can stop forwarding requests to unresponsive nodes? It happens that nodes are part of the cluster, but due to resource exhaustion (cpu usage) or other reasons, it's not responding in a timely manner.

This would make the loadbalancer more reliable as I've seen this impacting transactions on stargazer (since it depends on the loadbalancer and I noticed transactions aren't going through when they are forwarded to unresponsive nodes) and also helps reliability in the nodectl tool (since it uses the loadbalancer to query network state).

These are just my observations (so correct me if my assumptions are incorrect).
Perhaps it could also be sticky session issues if a node happens to become unresponsive during the TTL of a sticky session.

Thanks!

@AlexBrandes
Copy link
Member

Hey @flannoo - There are a few different possible reasons for poor performance from the load balancer. Some of them are fairly difficult to address from the load balancer itself, others could probably be handled better.

I've seen these issues:

  • L1 node is forked. This can happen and causes issues with sending transactions. The transaction appears to be accepted but then is dropped when passed to L0 consensus. We don't have a great way to detect this scenario externally on L1. It's easy to detect L0 by examining the snapshot chain but there's no equivalent on L1. This likely needs to be addressed at the protocol level.
  • Load balancer is down. We have monitors to bring the LBs back up if they go down but they can be down for a few minutes at a time, or longer if there's some significant issue. It's the danger of using centralized components like the LBs.
  • Stale cluster info. After a network restart or during other network changes, sometimes the load balancer will have an offline node in memory and still direct traffic to it. It's a bit complicated though because often these nodes are still up but now on an invalid fork of the network so they're still responsive but will not successfully process transactions.

I haven't personally seen a timeout situation with the load balancer that would be due to the resource exhaustion issue. We're aware of the resource exhaustion issue on validator nodes and have a fix being evaluated on TestNet now. Usually the nodes will be stuck in DIP or some other non-active state and so they shouldn't be picked up in the load balancer in the first place. Maybe they're being picked up when active and then stay on as stale nodes? Or else sometimes those nodes will respond sporadically which makes detection somewhat difficult.

I don't mean to list all these issues as a way to say there's nothing that can be done. I just wanted to illustrate some of the known issues and their potential solutions in case what you're seeing may be more related to one of those cases.

What do you think a potential fix would be for the issue you described here? Should we keep a record of response times from each node and remove nodes that have an average response time above some threshold?

@flannoo
Copy link
Author

flannoo commented Oct 12, 2024

Hi Alex,

Thank you for looking into this and explaining the different causes.

I looked a bit further into the code base, is the below criteria/assumptions correct?

Active host criteria
A node is considered active when these conditions are true:

  • the /cluster/info endpoint returns a valid info list within 5 seconds
  • when at least half of the nodes in the whole cluster reports the node to be in "Ready" state

This check happens every 60 seconds.
Traffic is only routed to active hosts.

Inactive host handling
When a node changes state (eg. DownloadInProgress), it's still included in the /cluster/info list, but traffic is no longer forwarded to it (so is marked as an inactiveHost). When a node is non-responsive (eg. it doesn't reply within 5 seconds), then it gets evicted completely from the LB.

In both scenarios, any sticky sessions that might forward requests to a node that turned inactive, will be invalidated and will route traffic to an active node instead.

If the above is correct, then that seems solid. I'm not sure how long it takes for the majority of the cluster to become aware of peer nodes turning into a different state, so perhaps it's an option to make the "proof of readiness" a bit more stricter and change it to 80% or more? (talking about this criteria point "when at least half of the nodes in the cluster report the node to be in "Ready" state").

Another thing to consider, is how this LB will scale. The current http client allows for 128 simultaneous connections, so the health probe queues up http requests when the number of nodes in the cluster exceeds 128. If most nodes respond fast, it shouldn't cause significant delays. But when a lot of nodes that are probed take longer than 5 seconds to respond, it could add to a delay to evict inactive nodes. Don't think it's a big issue currently, since we have about 250 nodes, so max delay when all nodes would take longer than 5 seconds, would be 10 seconds (which seems ok), but might be something to look out for in the future when the network opens up to everyone (then a distributed loadbalancer might start to make sense).

I initially thought there wasn't a real-time liveness probe happening on all nodes (and thought it just looked at the cluster info of one of the source nodes to collect the IP addresses to forward requests to nodes that a source node deemed "ready"), but digging deeper, it seems it does actually probe every node every 60 sec (if above is correct), which was my main concern.

Thanks!

@AlexBrandes
Copy link
Member

@flannoo - sorry for the long delay. I must have missed the notification of your reply.

If the above is correct, then that seems solid. I'm not sure how long it takes for the majority of the cluster to become aware of peer nodes turning into a different state, so perhaps it's an option to make the "proof of readiness" a bit more stricter and change it to 80% or more? (talking about this criteria point "when at least half of the nodes in the cluster report the node to be in "Ready" state").

Yeah, it sounds like you gained a pretty good understanding of how it works. Increasing the "proof of readiness" majority percentage is an interesting idea but I'm not sure that the latency of that state transition is a major source of issues.

Many of the problems really come back to the issue of forking on the network, and since our network is multi-layered there are multiple places that can happen. L1 forking is the most difficult to detect and, I suspect, where most of the issues with sending transactions via the load balancer come from. The problem is that an L1 node can fork and then still maintain a view of other nodes on a different fork for a period of time before they get removed for not participating in consensus. So we end up with the load balancer having multiple views of the network across multiple forks with no way to understand that it's viewing separate forks or which fork would be the correct one to follow, even if it was clear there was a fork happening.

So I think "Ready" state doesn't really tell the full picture of whether a node can successfully handle a transaction. It must be Ready and on the majority fork of the network.

Another thing to consider, is how this LB will scale...

I'm not too concerned with the scalability. We haven't gotten close to any of those limits yet and there are some simple changes we could make if we did hit those limits. We could have the LB track a max of N nodes at a time for example and have it randomly chose nodes up to that limit. Or there are fairly easy ways to increase the number of connections the service could handle. Or split the LB into two or more LBs that cover a portion/shard of the network. etc. etc.

We're going to be doing some upgrades to the LB codebase in the near future here (fix cross-env indexing, for example) so hopefully that will help with some of it. We also have some changes that we've been testing for nodes to self correct if they detect that they're on a minority fork which I think gets us closer to the real solution for most of these issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants