FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586

HeyangQin · 2024-05-29T18:44:32Z

No description provided.

tjruwase · 2024-05-30T22:07:16Z

deepspeed/inference/v2/modules/heuristics.py

-                          config=moe_config,
-                          implementation_config=implementation_config)
+    # check if we are on H100 or above
+    if torch.cuda.get_device_capability(0)[0] >= 9:  #ignore-cuda


We need an extension the accelerator interface to avoid cuda references.

Also, what is expected behavior when running on non-cuda devices, where torch.cuda is unavailable?

Good point. How about we extend the accelerator interface and return -1 for non-cuda devices?

I think we need two accelerator API changes

accelerator.name() that can be used in this case to restrict to cuda.

accelerator.compute_capability() returns tuple of int major, minor versions (similar to cuda approach) of current accelerator which the client can use for control flow.

While the first API is very straightforward, the second seems a bit tricky if we want accelerator to freely manage versioning.

@delock, @nelyahu, @hipudding I will appreciate your thoughts on if the proposed capability API provides sufficient freedom for your cases. Thanks!

@tjruwase

the existing device_name API with "device_index" input as None, provides this functionality.

This API is very specific where compute capabilties are encoded as major.minor. IMO such switching between optimizations should be configured by the user. and not set automatically according to device types. this is inconsistent even for GPUs. Switching from H100 to A100 for example to debug an accuracy issue on different setup will provide unexpected different behavior.
I am not familiar with the implementation of the ConfigBundle flow, so I can't help finding an alternative.
Also the CUDAGatedActivation is a CUDA specific - So it seems like the existing structure of the code is problematic in terms of accelerator generalization.

@tjruwase

Yes, when the device_index is set to None in the device_name function, it is possible to obtain the device name. For Ascend and CANN (npu_accelerator), the compute_capability is not yet needed, but I am not sure if there will be such a requirement in future versions. However, if this proposal is ready to be implemented, I would be happy to cooperate in modifying the npu_accelerator part.

accelerator.compute_capability() seems an accelerator specific code so even if caller gets this tuple, caller still needs to know which accelerator it is to make proper decision.

From the context, caller needs to decide whether cutlass_multi_gemm_moe should be used or pytorch_multi_gemm_moe should be used. Thus the following interface in accelerator might help this situation and extendable in the future:

accelerator.get_property("multi_gemm_moe"). This returns either None (accelerator didn't define this property) or "cutlass_multi_gemm_moe" (accelerator is CUDA with compute capability < 9) or anything else an accelerator defined and preferred to use. Then caller could use this property in the context:

name = get_accelerator().get_property("multi_gemm_moe") if name == None: name = "pytorch_multi_gemm_moe" # default behavior if accelerator didn't define this property config = ConfigBundle(name=name, config=moe_config, implementation_config=implementation_config)

And in CUDA accelerator we could have something like this:

def get_property(query): ... if query == "multi_gemm_moe": if torch.cuda.get_device_capability(0)[0] >= 9: return None # or "pytorch_multi_gemm_moe" else: return "cutlass_multi_gemm_moe" ... return None

Other accelerators only need to return None for unrecognized properties, so future maintainence time could be small.

def get_property(query): ... return None # query "multi_gemm_moe" returns from here

Add PyTorch multi-gemm MOE implementation

7b8ba2b

HeyangQin requested review from mrwyattii, awan-10 and arashb as code owners May 29, 2024 18:44

HeyangQin added 7 commits May 29, 2024 11:44

Merge branch 'master' into HeyangQin/fastgen_moe_h100

8274a01

fix format

88d758e

remove unused code

022a7c6

use accelerator abstract

c7db180

fix format

642e105

revert the accelerator commit

ce6b694

fix format

416fae4

tjruwase reviewed May 30, 2024

View reviewed changes

loadams added 2 commits July 24, 2024 09:21

Merge branch 'master' into HeyangQin/fastgen_moe_h100

1bd24de

Merge branch 'master' into HeyangQin/fastgen_moe_h100

d6d2adf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586

FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586

HeyangQin commented May 29, 2024

tjruwase May 30, 2024

tjruwase May 30, 2024

HeyangQin May 30, 2024

tjruwase May 31, 2024

nelyahu Jun 2, 2024

hipudding Jun 3, 2024

delock Jun 4, 2024 •

edited

Loading

FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586

Are you sure you want to change the base?

FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586

Conversation

HeyangQin commented May 29, 2024

tjruwase May 30, 2024

Choose a reason for hiding this comment

tjruwase May 30, 2024

Choose a reason for hiding this comment

HeyangQin May 30, 2024

Choose a reason for hiding this comment

tjruwase May 31, 2024

Choose a reason for hiding this comment

nelyahu Jun 2, 2024

Choose a reason for hiding this comment

hipudding Jun 3, 2024

Choose a reason for hiding this comment

delock Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

delock Jun 4, 2024 •

edited

Loading