-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Add CustomOp Interface to UnquantizedFusedMoEMethod #6289
Conversation
Does this need to be added to the fp8 method as well? Or are we handling quantization separately? https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/fp8.py#L220 |
@robertgshaw2-neuralmagic We haven't used the |
I think its okay to leave it for now and make the modifications once we have a need for it |
This PR seems to break Mixtral. Let me check the reason. |
What TP is it running at? @WoosukKwon |
@comaniac Could you please take a look? The PR removes a few lines of code in model loader that you marked as |
That FIXME should be removed safely. Please let me know if the test still fails and I'll take a look. |
@comaniac Thanks for the confirmation! It works well. |
…ect#6289) Signed-off-by: Alvant <[email protected]>
Currently,
UnquantizedFusedMoEMethod
directly imports the Triton fused MoE kernel and related CUDA kernels, preventing other hardware backends from supporting MoE models. This PR adds theCustomOp
interface to it so that the kernels are imported only for NVIDIA and AMD GPUs.