How nodes work in a cluster？ #4314

Gutaicheng · 2023-07-17T00:49:09Z

Gutaicheng
Jul 17, 2023

I know that if only one node is used in the publish-subscribe mode, nats-server will have a buffer (max_pending) and write blocking time (write_deadline) when sending messages to subscribers. If one of these two is reached, the server will Identify the subscriber as a slow consumer and disconnect.

Now suppose there is a cluster, the publisher publishes a message to node A, and the subscriber subscribes to the message on node B, then node A will forward the message to node B, I would like to know whether there is buffering or write blocking between nodes and node forwarding Time, I didn't find this in the documentation (sorry if there is, and hope you can help me point it out).

If so, how should I set it up. If not, are there any other measures in the face of insufficient bandwidth or disconnection between nodes?

I also found that the value of max_payload cannot limit the message from the node. For example, node A sets max_payload to 2MB, node B sets max_payload to 1MB, then when node A sends a 2MB message to node B, node B and its subscription The recipient will receive this 2MB message, is this normal?

Another problem is where I found ambiguity in the docs when reading about maxReconnects；
First is docs, I can think that if there are 5 servers in the list, and set maxReconnects to 10, then each server will try 10 times, and when a node tries to reconnect 10 times, it will be removed, and the total is 50 times.
Then I see in javadoc “if the server list contains 5 servers but max reconnects is set to 3, only 3 of those servers will be tried.”I tried coding and verified it, and found that it is indeed the case of the latter, so what does the former mean in the document?

Answered by derekcollison

Jul 17, 2023

The slow consumer between servers is limited to time vs size like clients. So it is possible to get a slow consumer on a route or gateway but only due to system call to pwrite() taking more that 10s (you can configure this).

If the system sends a message it is not bound by same max payload rules, by design.

For the reconnect, we strive to have all clients behave similarly, but will loop in @scottf for details there.

View full answer

derekcollison · 2023-07-17T18:46:20Z

derekcollison
Jul 17, 2023
Maintainer

The slow consumer between servers is limited to time vs size like clients. So it is possible to get a slow consumer on a route or gateway but only due to system call to pwrite() taking more that 10s (you can configure this).

If the system sends a message it is not bound by same max payload rules, by design.

For the reconnect, we strive to have all clients behave similarly, but will loop in @scottf for details there.

10 replies

wallyqs Jul 18, 2023
Maintainer

yes that is correct

Gutaicheng Jul 19, 2023
Author

Since the value of the buffer cannot be set, how does the node handle the situation where the memory is full due to the backlog of messages within the writedealine time?

derekcollison Jul 19, 2023
Maintainer

It closes the connection and will retry.

Gutaicheng Jul 20, 2023
Author

@derekcollison thanks! How do I set the memory size of the nats server, I didn't see the performance configuration in the configuration document, only in helm

derekcollison Jul 20, 2023
Maintainer

There is no way to necessarily set the total memory used by the server, but in environments that can limit it with cgroups etc, you can also set the GO_SOFT_MEMLIMIT env variable to inform the GC system of the limit. I believe we do this in the helm charts for k8s etc.

scottf · 2023-07-17T21:45:26Z

scottf
Jul 17, 2023
Collaborator

@Gutaicheng The documentation needs to be updated. Max retries is the number of times to retry to connect to each known server. So if there are 5 servers and you set max reconnect to 3,

If it fails to connect to any server, it will retry that server 3 more times.
If a server is connected to, the retries for that server is reset.
When connecting. the server presents the list of servers it knows about and these are folded in to a full list and servers no longer known by the server are removed from that list, except if they were in the original bootstrap.

I think an example is an easier way to demonstrate this

--- bootstrap is nats://host:4222, nats://host:5222, nats://host:6222, nats://host:7222, nats://host:8222
--- connection is set to randomize (default) and max retries of 3
--- 4222 is the only server available.

trying nats://host:7222 - tried, marked as failed once & retried 0 times
trying nats://host:8222 - tried, marked as failed once & retried 0 times
trying nats://host:4222 - tried, retries cleared to 0

--- after being connected to 4222, server 4222 was brought down, there are no servers available.

trying nats://host:6222 - tried, marked as failed once & retried 0 times
trying nats://host:5222 - tried, marked as failed once & retried 0 times
trying nats://host:7222 - tried, marked as failed twice & retried 1 time
trying nats://host:8222 - tried, marked as failed twice & retried 1 time
trying nats://host:4222 - tried, marked as failed once & retried 0 times

trying nats://host:6222 - tried, marked as failed twice & retried 1 time
trying nats://host:5222 - tried, marked as failed twice & retried 1 time
trying nats://host:7222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:8222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:4222 - tried, marked as failed twice & retried 1 time

trying nats://host:6222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:5222 - tried, marked as failed 3 times & retried 2 times
trying nats://host:7222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:8222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:4222 - tried, marked as failed 3 times & retried 2 times

trying nats://host:6222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:5222 - tried, marked as failed 4 times & retried 3 times
trying nats://host:4222 - tried, marked as failed 4 times & retried 3 times

done retrying

1 reply

Gutaicheng Jul 18, 2023
Author

Thank you very much, I think I have understood the measures on how to reconnect

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How nodes work in a cluster？ #4314

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How nodes work in a cluster？ #4314

Gutaicheng Jul 17, 2023

Replies: 2 comments · 11 replies

derekcollison Jul 17, 2023 Maintainer

wallyqs Jul 18, 2023 Maintainer

Gutaicheng Jul 19, 2023 Author

derekcollison Jul 19, 2023 Maintainer

Gutaicheng Jul 20, 2023 Author

derekcollison Jul 20, 2023 Maintainer

scottf Jul 17, 2023 Collaborator

Gutaicheng Jul 18, 2023 Author

Gutaicheng
Jul 17, 2023

Replies: 2 comments 11 replies

derekcollison
Jul 17, 2023
Maintainer

wallyqs Jul 18, 2023
Maintainer

Gutaicheng Jul 19, 2023
Author

derekcollison Jul 19, 2023
Maintainer

Gutaicheng Jul 20, 2023
Author

derekcollison Jul 20, 2023
Maintainer

scottf
Jul 17, 2023
Collaborator

Gutaicheng Jul 18, 2023
Author