-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FastGen H100 MoE support: Add PyTorch multi-gemm MOE implementation #5586
base: master
Are you sure you want to change the base?
Conversation
config=moe_config, | ||
implementation_config=implementation_config) | ||
# check if we are on H100 or above | ||
if torch.cuda.get_device_capability(0)[0] >= 9: #ignore-cuda |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an extension the accelerator interface to avoid cuda
references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what is expected behavior when running on non-cuda devices, where torch.cuda
is unavailable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. How about we extend the accelerator interface and return -1
for non-cuda devices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need two accelerator API changes
accelerator.name()
that can be used in this case to restrict tocuda
.accelerator.compute_capability()
returns tuple of int major, minor versions (similar to cuda approach) of current accelerator which the client can use for control flow.
While the first API is very straightforward, the second seems a bit tricky if we want accelerator to freely manage versioning.
@delock, @nelyahu, @hipudding I will appreciate your thoughts on if the proposed capability API provides sufficient freedom for your cases. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the existing device_name API with "device_index" input as None, provides this functionality.
- This API is very specific where compute capabilties are encoded as major.minor. IMO such switching between optimizations should be configured by the user. and not set automatically according to device types. this is inconsistent even for GPUs. Switching from H100 to A100 for example to debug an accuracy issue on different setup will provide unexpected different behavior.
I am not familiar with the implementation of the ConfigBundle flow, so I can't help finding an alternative.
Also the CUDAGatedActivation is a CUDA specific - So it seems like the existing structure of the code is problematic in terms of accelerator generalization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when the device_index is set to None in the device_name function, it is possible to obtain the device name. For Ascend and CANN (npu_accelerator), the compute_capability is not yet needed, but I am not sure if there will be such a requirement in future versions. However, if this proposal is ready to be implemented, I would be happy to cooperate in modifying the npu_accelerator part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accelerator.compute_capability()
seems an accelerator specific code so even if caller gets this tuple, caller still needs to know which accelerator it is to make proper decision.
From the context, caller needs to decide whether cutlass_multi_gemm_moe
should be used or pytorch_multi_gemm_moe
should be used. Thus the following interface in accelerator might help this situation and extendable in the future:
accelerator.get_property("multi_gemm_moe")
. This returns either None
(accelerator didn't define this property) or "cutlass_multi_gemm_moe"
(accelerator is CUDA with compute capability < 9) or anything else an accelerator defined and preferred to use. Then caller could use this property in the context:
name = get_accelerator().get_property("multi_gemm_moe")
if name == None:
name = "pytorch_multi_gemm_moe" # default behavior if accelerator didn't define this property
config = ConfigBundle(name=name,
config=moe_config,
implementation_config=implementation_config)
And in CUDA accelerator we could have something like this:
def get_property(query):
...
if query == "multi_gemm_moe":
if torch.cuda.get_device_capability(0)[0] >= 9:
return None # or "pytorch_multi_gemm_moe"
else:
return "cutlass_multi_gemm_moe"
...
return None
Other accelerators only need to return None for unrecognized properties, so future maintainence time could be small.
def get_property(query):
...
return None # query "multi_gemm_moe" returns from here
No description provided.