[TPU] Support collective communications in XLA devices #6813

WoosukKwon · 2024-07-26T01:51:19Z

This PR adds support for collective communications in XLA devices (TPU). It is simply implemented by falling back to xm.all_reduce and xm.all_gather for TPU devices. One difference is the gather operation in logits processor, where gather is replaced by all-gather to meet the SPMD restriction in XLA.

github-actions · 2024-07-26T01:51:31Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

WoosukKwon · 2024-07-26T01:53:25Z

This PR was part of #5871, but separated out to get a quick review.

vllm/platforms/interface.py

youkaichao · 2024-07-26T19:54:10Z

vllm/distributed/parallel_state.py

    pynccl_comm: Optional[Any]  # PyNccl communicator
    ca_comm: Optional[Any]  # Custom allreduce communicator
    mq_broadcaster: Optional[Any]  # shared memory broadcaster
+    use_xla: bool  # Whether to use PyTorch XLA communicator


does tpu platform support NCCL? if not, creating these communicators might lead to error.

TPU doesn't support NCCL, but I didn't see any error with the other communicators.

The TPU backend uses gloo backend in addition to the distributed runtime in xm. Maybe that's the reason.

youkaichao · 2024-07-26T20:52:37Z

vllm/distributed/parallel_state.py

@@ -125,6 +129,7 @@ class GroupCoordinator:
    pynccl_comm: Optional[Any]  # PyNccl communicator
    ca_comm: Optional[Any]  # Custom allreduce communicator
    mq_broadcaster: Optional[Any]  # shared memory broadcaster
+    use_xla: bool  # Whether to use PyTorch XLA communicator


use_xxx is a initialization parameter, and we usually hold communicator inside group coordinator.

Can you add a tpu_communicator under https://github.com/vllm-project/vllm/tree/main/vllm/distributed/device_communicators ?

One additional benefit, is that you can implement the gather logic to allgather, without intrusive change to logits_processor.py .

One additional benefit, is that you can implement the gather logic to allgather, without intrusive change to logits_processor.py .

This is actually not the case because the TPU backend explicitly requires all-gather, which means each device's output should not be None. If we implement gather by using all-gather and outputting None for non-root ranks, XLA will raise an error.

vllm/model_executor/layers/logits_processor.py

youkaichao · 2024-07-26T22:23:13Z

vllm/distributed/device_communicators/tpu_communicator.py

+            return
+        self.disabled = False
+
+        pjrt.initialize_multiprocess(local_rank, world_size)


you can get rank and world size from the group

Thanks for letting me know! Updated the PR. PTAL.

do you need to remove the code inside tpu worker? I don't know if pjrt and xm support initialization for multiple times.

@youkaichao Good point. That's actually updated in #5871. In the current main branch, there's no code initializing the XLA's distributed runtime.

vllm/lora/layers.py

youkaichao

Thanks for addressing my comments!

WoosukKwon · 2024-07-26T23:01:15Z

@youkaichao Thanks for your review and suggestions to the PR!

…6813)

…6813) Signed-off-by: Alvant <[email protected]>

…6813)

[TPU] Support collective communications in XLA devices

af3a259

WoosukKwon added tpu Related to Google TPUs ready ONLY add when PR is ready to merge/full CI is needed labels Jul 26, 2024

WoosukKwon requested a review from youkaichao July 26, 2024 01:51

Use current_platform

0f2abea