Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARSN-381 RPC command system between cluster workers #2202

Merged
merged 2 commits into from
Jan 12, 2024

Conversation

jonathan-gramain
Copy link
Contributor

When using the cluster module, new processes are forked and are dispatched workloads, usually HTTP requests. The ClusterRPC module implements a RPC system to send commands to all cluster worker processes at once from any particular worker, and retrieve their individual command results, like a distributed map operation.

The existing cluster IPC channel is setup from the primary to each worker, but not between workers, so there has to be a hop by the primary.

How a command is treated:

  • a worker sends a command message to the primary

  • the primary then forwards that command to each existing worker (including the requestor)

  • each worker then executes the command and returns a result or an error

  • the primary gathers all workers results into an array

  • finally, the primary dispatches the results array to the original requesting worker callback

The original use of this feature is in Metadata DBD (bucketd) to implement a global cache refresh across worker processes.

@bert-e
Copy link
Contributor

bert-e commented Dec 29, 2023

Hello jonathan-gramain,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Status report is not available.

@bert-e
Copy link
Contributor

bert-e commented Dec 29, 2023

Request integration branches

Waiting for integration branch creation to be requested by the user.

To request integration branches, please comment on this pull request with the following command:

/create_integration_branches

Alternatively, the /approve and /create_pull_requests commands will automatically
create the integration branches.

@jonathan-gramain jonathan-gramain force-pushed the improvement/ARSN-381-cluster-rpc-helpers branch from 97150d8 to e5fa1b6 Compare December 29, 2023 22:17
uids,
payload,
};
return process.send?.(message);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If process.send is undefined no worker will receive the call, so it might be worth failing early at init if the IPC channel is not available.

* - each worker then executes the command and returns a result or an
* error
*
* - the primary gathers all workers results into an array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a way to set a deadline or timeout. If one worker is stuck or dies for any reason, the command will stay in-flight and never return - leaving the requester to either hang or trigger its own timeout, and uidsToWorkerId would keep growing over time.

This might be acceptable, in which case a mention of that behavior in the jsdoc would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea indeed to add a worker timeout for the reason you mention (maybe configurable via an extra argument to sendWorkerCommand).


// exported types

export type ResultObject = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if Promises are available within the target environment (I'd think so), but they would help with a nicer API even if async/await is not available: getting rid of nested callbacks and making the ResultObject wrapper type superfluous (even possibly the callback types but that may be going too far).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could use promises. Actually Metadata can use promises and async/await even though most of the code is legacy, there are small pieces of code that do use them.

It's not clear to me why ResultObject would not be needed anymore though, but I can look at it I may figure it out :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I understand, replacing the "error" field by a rejected promise, and the "results" by a resolved one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly!

* - the primary then forwards that command to each existing worker
* (including the requestor)
*
* - each worker then executes the command and returns a result or an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of obvious but a mention might be warranted of the type of messages that are allowed in/out - since they have to be serialized, any circular references or very big buffers would cause issues.

return undefined;
}
// send back response to original worker
_dispatchCommandResultsToWorker(toWorker, uids, completeCommandResultsArray);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a good idea to include a result slot for any new workers spawned after the command was first dispatched, that didn't get to execute the command.

I understand that it's not required for the initial purpose of just a cache flush, but since it's a more "generic" module, future users of this code might be interested to know that some processes that didn't get to run the command are currently servicing requests. Or, just a mention in the jsdoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can mention this fact in the docs. To check this I think we would have to check the cluster.workers array again before responding, or maybe by watching events when a new worker is spawned. Since it's not needed for now I think a comment will be fine.

@jonathan-gramain jonathan-gramain force-pushed the improvement/ARSN-381-cluster-rpc-helpers branch from b713f0f to 7cd0518 Compare January 11, 2024 23:51
@jonathan-gramain jonathan-gramain force-pushed the improvement/ARSN-381-cluster-rpc-helpers branch from 7cd0518 to 5c2bfce Compare January 12, 2024 00:16
When using the cluster module, new processes are forked and are
dispatched workloads, usually HTTP requests. The ClusterRPC module
implements a RPC system to send commands to all cluster worker
processes at once from any particular worker, and retrieve their
individual command results, like a distributed map operation.

The existing cluster IPC channel is setup from the primary to each
worker, but not between workers, so there has to be a hop by the
primary.

How a command is treated:

- a worker sends a command message to the primary

- the primary then forwards that command to each existing worker
  (including the requestor)

- each worker then executes the command and returns a result or an
  error

- the primary gathers all workers results into an array

- finally, the primary dispatches the results array to the original
  requesting worker callback

The original use of this feature is in Metadata DBD (bucketd) to
implement a global cache refresh across worker processes.
@jonathan-gramain jonathan-gramain force-pushed the improvement/ARSN-381-cluster-rpc-helpers branch from 5c2bfce to 3b9c93b Compare January 12, 2024 00:26
@jonathan-gramain
Copy link
Contributor Author

/approve

@bert-e
Copy link
Contributor

bert-e commented Jan 12, 2024

Conflict

A conflict has been raised during the creation of
integration branch w/8.1/improvement/ARSN-381-cluster-rpc-helpers with contents from improvement/ARSN-381-cluster-rpc-helpers
and development/8.1.

I have not created the integration branch.

Here are the steps to resolve this conflict:

 $ git fetch
 $ git checkout -B w/8.1/improvement/ARSN-381-cluster-rpc-helpers origin/development/8.1
 $ git merge origin/improvement/ARSN-381-cluster-rpc-helpers
 $ # <intense conflict resolution>
 $ git commit
 $ git push -u origin w/8.1/improvement/ARSN-381-cluster-rpc-helpers

The following options are set: approve

@bert-e
Copy link
Contributor

bert-e commented Jan 12, 2024

In the queue

The changeset has received all authorizations and has been added to the
relevant queue(s). The queue(s) will be merged in the target development
branch(es) as soon as builds have passed.

The changeset will be merged in:

  • ✔️ development/7.70

  • ✔️ development/8.1

The following branches will NOT be impacted:

  • development/6.4
  • development/7.10
  • development/7.4

There is no action required on your side. You will be notified here once
the changeset has been merged. In the unlikely event that the changeset
fails permanently on the queue, a member of the admin team will
contact you to help resolve the matter.

IMPORTANT

Please do not attempt to modify this pull request.

  • Any commit you add on the source branch will trigger a new cycle after the
    current queue is merged.
  • Any commit you add on one of the integration branches will be lost.

If you need this pull request to be removed from the queue, please contact a
member of the admin team now.

The following options are set: approve

@bert-e
Copy link
Contributor

bert-e commented Jan 12, 2024

I have successfully merged the changeset of this pull request
into targetted development branches:

  • ✔️ development/7.70

  • ✔️ development/8.1

The following branches have NOT changed:

  • development/6.4
  • development/7.10
  • development/7.4

Please check the status of the associated issue ARSN-381.

Goodbye jonathan-gramain.

@bert-e bert-e merged commit 3b9c93b into development/7.70 Jan 12, 2024
7 checks passed
@bert-e bert-e deleted the improvement/ARSN-381-cluster-rpc-helpers branch January 12, 2024 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants