-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Multinode hang fix with PP #6399
Conversation
Signed-off-by: Muralidhar Andoorveedu <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well. To run full CI, you can do one of these:
🚀 |
self.tp_driver_workers.append(self.workers[rank - 1]) | ||
else: | ||
self.non_driver_workers.append(self.workers[rank - 1]) | ||
for idx, rank in enumerate(worker_ranks[1:]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can iterate over worker_ranks
? I think worker_ranks[0] == 0
is always true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but then we need to do idx - 1
. It's a little bit less clean. We can get rid of the first if
condition as well if you like that style more, but I kept it for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's get rid of the first if condition as well, and put a comment here, saying that worker_ranks[0] == 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks for the hard debugging! |
Signed-off-by: Muralidhar Andoorveedu <[email protected]>
cherry-picked these commits to #6280 so that it can be tested. |
We needed to merge #6235 to allow for correct TP/PP rank assignment within a node. However, this caused our driver worker assignment to be out of sync now.
For TP = 2/PP = 2 case we previously had the following correspondence between ranks and workers.
We would then iterate through and infer the rank based on the index of the worker in the list. Therefore we would have W1, W3 be assigned as driver workers, which is ok as they correspond to Ranks 0, 2.
Now the ranks might be switched, i.e we can have the following:
In this case, we cannot have W1, W3 still be drivers, as this would cause deadlock. Rank 2 would have a
None
execute_model_req
which would cause the workers to stop executing. Instead, we have to useworker_ranks
to assign the correct workers W1 and W4 to the driver list.cc: @youkaichao @aurickq