Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

HeyangQin
Copy link
Contributor

No description provided.

config=moe_config,
implementation_config=implementation_config)
# check if we are on H100 or above
if torch.cuda.get_device_capability(0)[0] >= 9: #ignore-cuda
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an extension the accelerator interface to avoid cuda references.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what is expected behavior when running on non-cuda devices, where torch.cuda is unavailable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. How about we extend the accelerator interface and return -1 for non-cuda devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need two accelerator API changes

  1. accelerator.name() that can be used in this case to restrict to cuda.
  2. accelerator.compute_capability() returns tuple of int major, minor versions (similar to cuda approach) of current accelerator which the client can use for control flow.

While the first API is very straightforward, the second seems a bit tricky if we want accelerator to freely manage versioning.

@delock, @nelyahu, @hipudding I will appreciate your thoughts on if the proposed capability API provides sufficient freedom for your cases. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjruwase

  1. the existing device_name API with "device_index" input as None, provides this functionality.
  2. This API is very specific where compute capabilties are encoded as major.minor. IMO such switching between optimizations should be configured by the user. and not set automatically according to device types. this is inconsistent even for GPUs. Switching from H100 to A100 for example to debug an accuracy issue on different setup will provide unexpected different behavior.
    I am not familiar with the implementation of the ConfigBundle flow, so I can't help finding an alternative.
    Also the CUDAGatedActivation is a CUDA specific - So it seems like the existing structure of the code is problematic in terms of accelerator generalization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjruwase

Yes, when the device_index is set to None in the device_name function, it is possible to obtain the device name. For Ascend and CANN (npu_accelerator), the compute_capability is not yet needed, but I am not sure if there will be such a requirement in future versions. However, if this proposal is ready to be implemented, I would be happy to cooperate in modifying the npu_accelerator part.

Copy link
Collaborator

@delock delock Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accelerator.compute_capability() seems an accelerator specific code so even if caller gets this tuple, caller still needs to know which accelerator it is to make proper decision.

From the context, caller needs to decide whether cutlass_multi_gemm_moe should be used or pytorch_multi_gemm_moe should be used. Thus the following interface in accelerator might help this situation and extendable in the future:

accelerator.get_property("multi_gemm_moe"). This returns either None (accelerator didn't define this property) or "cutlass_multi_gemm_moe" (accelerator is CUDA with compute capability < 9) or anything else an accelerator defined and preferred to use. Then caller could use this property in the context:

name = get_accelerator().get_property("multi_gemm_moe")
if name == None:
    name = "pytorch_multi_gemm_moe" # default behavior if accelerator didn't define this property
config = ConfigBundle(name=name,
                      config=moe_config,
                      implementation_config=implementation_config)

And in CUDA accelerator we could have something like this:

def get_property(query):
    ...
    if query == "multi_gemm_moe":
        if torch.cuda.get_device_capability(0)[0] >= 9:
            return None  # or "pytorch_multi_gemm_moe"
        else:
            return "cutlass_multi_gemm_moe"
   ...
   return None

Other accelerators only need to return None for unrecognized properties, so future maintainence time could be small.

def get_property(query):
   ...
   return None   # query "multi_gemm_moe" returns from here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants