Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Multinode hang fix with PP #6399

Closed
wants to merge 2 commits into from

Conversation

andoorve
Copy link
Collaborator

We needed to merge #6235 to allow for correct TP/PP rank assignment within a node. However, this caused our driver worker assignment to be out of sync now.

For TP = 2/PP = 2 case we previously had the following correspondence between ranks and workers.

Ranks   [0,  1,  2,  3]
Workers [W1, W2, W3, W4] 

We would then iterate through and infer the rank based on the index of the worker in the list. Therefore we would have W1, W3 be assigned as driver workers, which is ok as they correspond to Ranks 0, 2.

Now the ranks might be switched, i.e we can have the following:

Ranks   [0,  1,  3,  2]
Workers [W1, W2, W3, W4] 

In this case, we cannot have W1, W3 still be drivers, as this would cause deadlock. Rank 2 would have a None execute_model_req which would cause the workers to stop executing. Instead, we have to use worker_ranks to assign the correct workers W1 and W4 to the driver list.

cc: @youkaichao @aurickq

Signed-off-by: Muralidhar Andoorveedu <[email protected]>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

self.tp_driver_workers.append(self.workers[rank - 1])
else:
self.non_driver_workers.append(self.workers[rank - 1])
for idx, rank in enumerate(worker_ranks[1:]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can iterate over worker_ranks? I think worker_ranks[0] == 0 is always true.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but then we need to do idx - 1. It's a little bit less clean. We can get rid of the first if condition as well if you like that style more, but I kept it for readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's get rid of the first if condition as well, and put a comment here, saying that worker_ranks[0] == 0 .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@youkaichao
Copy link
Member

Thanks for the hard debugging!

Signed-off-by: Muralidhar Andoorveedu <[email protected]>
@youkaichao
Copy link
Member

cherry-picked these commits to #6280 so that it can be tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants