Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TPU] Support collective communications in XLA devices #6813

Merged
merged 6 commits into from
Jul 27, 2024
Merged

Conversation

WoosukKwon
Copy link
Collaborator

This PR adds support for collective communications in XLA devices (TPU). It is simply implemented by falling back to xm.all_reduce and xm.all_gather for TPU devices. One difference is the gather operation in logits processor, where gather is replaced by all-gather to meet the SPMD restriction in XLA.

@WoosukKwon WoosukKwon added tpu Related to Google TPUs ready ONLY add when PR is ready to merge/full CI is needed labels Jul 26, 2024
@WoosukKwon WoosukKwon requested a review from youkaichao July 26, 2024 01:51
Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@WoosukKwon
Copy link
Collaborator Author

This PR was part of #5871, but separated out to get a quick review.

Comment on lines 129 to 132
pynccl_comm: Optional[Any] # PyNccl communicator
ca_comm: Optional[Any] # Custom allreduce communicator
mq_broadcaster: Optional[Any] # shared memory broadcaster
use_xla: bool # Whether to use PyTorch XLA communicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does tpu platform support NCCL? if not, creating these communicators might lead to error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPU doesn't support NCCL, but I didn't see any error with the other communicators.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TPU backend uses gloo backend in addition to the distributed runtime in xm. Maybe that's the reason.

@WoosukKwon WoosukKwon requested a review from youkaichao July 26, 2024 20:06
@@ -125,6 +129,7 @@ class GroupCoordinator:
pynccl_comm: Optional[Any] # PyNccl communicator
ca_comm: Optional[Any] # Custom allreduce communicator
mq_broadcaster: Optional[Any] # shared memory broadcaster
use_xla: bool # Whether to use PyTorch XLA communicator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use_xxx is a initialization parameter, and we usually hold communicator inside group coordinator.

Can you add a tpu_communicator under https://github.com/vllm-project/vllm/tree/main/vllm/distributed/device_communicators ?

One additional benefit, is that you can implement the gather logic to allgather, without intrusive change to logits_processor.py .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One additional benefit, is that you can implement the gather logic to allgather, without intrusive change to logits_processor.py .

This is actually not the case because the TPU backend explicitly requires all-gather, which means each device's output should not be None. If we implement gather by using all-gather and outputting None for non-root ranks, XLA will raise an error.

return
self.disabled = False

pjrt.initialize_multiprocess(local_rank, world_size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can get rank and world size from the group

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know! Updated the PR. PTAL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to remove the code inside tpu worker? I don't know if pjrt and xm support initialization for multiple times.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@youkaichao Good point. That's actually updated in #5871. In the current main branch, there's no code initializing the XLA's distributed runtime.

Copy link
Member

@youkaichao youkaichao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments!

@WoosukKwon
Copy link
Collaborator Author

WoosukKwon commented Jul 26, 2024

@youkaichao Thanks for your review and suggestions to the PR!

@WoosukKwon WoosukKwon enabled auto-merge (squash) July 26, 2024 23:01
@WoosukKwon WoosukKwon merged commit d09b94c into main Jul 27, 2024
71 of 73 checks passed
@WoosukKwon WoosukKwon deleted the add-xla-comm branch July 27, 2024 01:59
cadedaniel pushed a commit to cadedaniel/vllm-public that referenced this pull request Jul 27, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed tpu Related to Google TPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants