-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Kernel ] AWQ Fused MoE #6422
[ Kernel ] AWQ Fused MoE #6422
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well. To run full CI, you can do one of these:
🚀 |
Thanks for doing this. |
/ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be shaping up well! I appreciate the refactoring of fused_moe. It would be more clear if it was separated out first, but I think we can land it within this PR if needed.
|
||
logger = init_logger(__name__) | ||
|
||
NAIVE_THRESHOLD = 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit high and it is worth commenting how it was calibrated (what model, benchmark, GPU used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic do we know why this is 1024 specifically?
@mgoin can't resolve but addressed all but one comment |
Latency Benchmarking with Two 82 GB A100s:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We should add a model test, but considering there isn't a small mixtral awq to test, maybe just including a lm-eval large test would be sufficient.
Note: Splitting this PR into two separate PRs. |
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
This pull request has merge conflicts that must be resolved before it can be |
SUMMARY:
AWQMoEmethod
, supporting loading AutoAWQ modelsFusedMoE.weight_loader
, to enable loading AWQ models, which have transposed weights (input_dim, output_dim) on disk. Fp16 and Fp8 models have share (input_dim, output_dim). This required more complex logic for handling indexing in the TP case and MergedColumn caseexpert_params_mapping
, which was overfit to fp16 and fp8 checkpoints. This required renaming the scale parameters in fp8 which to better match the state dicts that we create in autofp8, limiting the amount of remapping we need to do in the model filesfused_topk
/grouped_topk
andfused_experts
, rather than callingfused_moe
directly, such that we can reuse the logic acrossfp16
,fp8
, andawq
Latency Benchmarking with 2xA100 80GB:
Followup work:
TechxGenus/DeepSeek-V2-Lite-Chat-AWQ