Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undeploy models with no WorkerNodes #3380

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

brianf-aws
Copy link
Contributor

@brianf-aws brianf-aws commented Jan 11, 2025

This PR aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful.

Description

When performing undeploy if the model has no nodes associated to it then it will reset the index to UNDEPLOY status

Here is an example of why this code change is needed

This secnario is for the PARTIALLY_DEPLOYED issue.

  1. Have nodes a,b,c,d in cluster associated with modelID:@ i.e. peform deploy on it
  2. Bring a,b down while having the syncup job running
  3. By Now sync up will make this PARTIALLY_UNDEPLOYED
  4. stop sync up
  5. Bring other 2 c,d down and bring 2 nodes (these now have new ids 1,2)
  6. bring the other two nodes back which have different ids so now the cluster has (1,2,3,4)
    But the model index says PARTIALLY_DEPLOYED and no nodes are servicing

This code fix says. If no nodes are servicing this model then I need to set the index to UNDEPLOYED no matter if its already UNDEPLOYED or not.

Related Issues

Resolves #3285

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

This commit aims to undeploy modelIds that have no nodes associated to them so as to keep the intention of undeploy truthful.

Signed-off-by: Brian Flores <[email protected]>
bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
log.info("No models service: {}", modelIds.toString());
client.bulk(bulkUpdateRequest, ActionListener.wrap(br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, e -> {
log.error("Failed to set modelIds to UNDEPLOY in index", e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we send this exception back to client side ? If yes, we should pass listener to this method and add this line here

listener.onFailure(e);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@brianf-aws brianf-aws Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure the user is concerned with the failure? The only way I see this as a problem if the model index does not exist and user does undeploy, this might cause a write issue.

Added your sugesstion to report the failure if it cant write back to the index.

@@ -157,10 +163,36 @@ private void undeployModels(String[] targetNodeIds, String[] modelIds, ActionLis
MLUndeployModelNodesRequest mlUndeployModelNodesRequest = new MLUndeployModelNodesRequest(targetNodeIds, modelIds);

client.execute(MLUndeployModelAction.INSTANCE, mlUndeployModelNodesRequest, ActionListener.wrap(r -> {
if (r.getNodes().isEmpty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it make sense that when execute undeploy model, the response return no worder nodes, then set the model to undeploy.

but this doesn't fix the partially undeployed issue, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PARTIALLY_UNDEPLOYED is a bit of a mixture of different scenarios one of them like so

  1. Have nodes a,b,c,d in cluster associated with modelID:@ i.e. peform deploy on it
  2. Bring a,b down while having the syncup job running
  3. By Now sync up will make this PARTIALLY_UNDEPLOYED
  4. stop sync up
  5. Bring other 2 c,d down and bring 2 nodes (these now have new ids 1,2)
  6. bring the other two nodes back which have different ids so now the cluster has (1,2,3,4)
    But the model index says PARTIALLY_DEPLOYED and no nodes are servicing

This code fix says. If no nodes are servicing this model then I need to set the index to UNDEPLOYED no matter if its already UNDEPLOYED or not.

@dhrubo-os
Copy link
Collaborator

Can we add unit test?

}
bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
log.info("No models service: {}", modelIds.toString());
client.bulk(bulkUpdateRequest, ActionListener.wrap(br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, e -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can return MLUndeployModelsResponse including the original nodes rather than empty. You can find some examples in the tests how to create a new MLUndeployModelsResponse.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't return in br -> { log.debug("Successfully set modelIds to UNDEPLOY in index"); }, it's possible that when client side receive the undeploy response, the model still on DEPLOYED state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure how I feel about creating a new one based on the failures,I think it will be misleading. I will pass the original response r instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The failures should just return a message. But for the success case we should return something rather than {}

Copy link
Collaborator

@dhrubo-os dhrubo-os Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if success case shows (to keep consistency with current output for partially deployed case):

{
"node id 1" : Not Found,
"node id 2" : Not Found
}

I don't think this will add much value from customer's POV.

But may be we can send a Success message to customer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still give empty response which is not accurate. Since we cannot send nodes as response in this case, lets send something to show model/models undeployed successfully. Something like

{
<model_id_1>: "UNDEPLOYED SUCCESSFULLY",
<model_id_2>: "UNDEPLOYED SUCCESSFULLY"
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed internally that we would not want to send back this information as we don't want to break bwc if we send back a updated response.

Also I'm thinking if we write UNDEPLOYED Successfully, this may sound like it performed undeployement but the reality is that it is just updating the index and not performing any update on the nodes carrying the "model".

@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval January 11, 2025 04:36 — with GitHub Actions Waiting
Now when entering this method its guaranteed to write to index first before sending back the MLUndeploy response. And will also send back a exception if the write back fails

Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 11, 2025 04:51 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 11, 2025 04:51 — with GitHub Actions Failure
Added UTs for the 2 scenarios 1. Check that the bulk operation occured when no nodes are returned from the Undeploy response is , 2. Check that the bulk operation did not occur when there are nodes that have found the model within their cache.

Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 12, 2025 04:30 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 12, 2025 04:30 — with GitHub Actions Inactive
@brianf-aws
Copy link
Contributor Author

Can we add unit test?

Added 2 UTs for code fix

  1. Check that the bulk write occurred when undeploy returned {}. This is a sign that the stale model index is UNDEPLOYED
  2. Check that bulk write did not occur when some nodes have a response to the model, undeploy occured and changed index.

client.bulk(bulkRequest, ActionListener.runAfter(actionListener, () -> {
syncUpUndeployedModels(syncUpRequest);
listener.onResponse(undeployModelNodesResponse);
}));

@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 12, 2025 07:39 — with GitHub Actions Failure
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 13, 2025 19:55 — with GitHub Actions Inactive
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 13, 2025 19:55 — with GitHub Actions Failure
@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval January 13, 2025 20:57 — with GitHub Actions Waiting
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 14, 2025 00:44 — with GitHub Actions Failure
@brianf-aws brianf-aws had a problem deploying to ml-commons-cicd-env-require-approval January 14, 2025 00:44 — with GitHub Actions Failure
Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 14, 2025 00:45 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 14, 2025 00:45 — with GitHub Actions Inactive
ActionListener<MLUndeployModelsResponse> listenerWithContextRestoration = ActionListener
.runBefore(listener, () -> threadContext.restore());
ActionListener<BulkResponse> bulkResponseListener = ActionListener.wrap(br -> {
log.debug("Successfully set modelIds to UNDEPLOY in index");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add model ids to log?

}

bulkUpdateRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
log.info("No nodes service: {}", Arrays.toString(modelIds));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about clarifying more ?

Suggested change
log.info("No nodes service: {}", Arrays.toString(modelIds));
log.info("No nodes running these models: {}", Arrays.toString(modelIds));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid feedback added to commit 77f6e5b

@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 14, 2025 01:09 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 14, 2025 01:09 — with GitHub Actions Inactive
@brianf-aws brianf-aws requested a deployment to ml-commons-cicd-env-require-approval January 14, 2025 03:30 — with GitHub Actions Waiting
Signed-off-by: Brian Flores <[email protected]>
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 14, 2025 05:13 — with GitHub Actions Inactive
@brianf-aws brianf-aws temporarily deployed to ml-commons-cicd-env-require-approval January 14, 2025 05:13 — with GitHub Actions Inactive
@brianf-aws brianf-aws deployed to ml-commons-cicd-env-require-approval January 14, 2025 18:21 — with GitHub Actions Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Model undeploying giving empty response
7 participants