[ Kernel ] AWQ Fused MoE #6422

robertgshaw2-redhat · 2024-07-14T11:17:24Z

SUMMARY:

Picks up No executable after building vllm from source with CPU support #6259 2761 to support AWQ MoE models via a fused kernel, after the refactor I did in [ Misc ] Refactor MoE to isolate Fp8 From Mixtral #5970. The kernels for this PR were developed by @chu-tianxiang
Adds AWQMoEmethod, supporting loading AutoAWQ models
Refactors FusedMoE.weight_loader, to enable loading AWQ models, which have transposed weights (input_dim, output_dim) on disk. Fp16 and Fp8 models have share (input_dim, output_dim). This required more complex logic for handling indexing in the TP case and MergedColumn case
Refactors expert_params_mapping, which was overfit to fp16 and fp8 checkpoints. This required renaming the scale parameters in fp8 which to better match the state dicts that we create in autofp8, limiting the amount of remapping we need to do in the model files
Updates layers to use fused_topk/grouped_topk and fused_experts, rather than calling fused_moe directly, such that we can reuse the logic across fp16, fp8, and awq

Latency Benchmarking with 2xA100 80GB:

Mixtral Fused MoE with AWQ:
> python benchmarks/benchmark_latency.py --model casperhansen/mixtral-instruct-awq --tensor-parallel-size 2 --input-len 512 --output-len 128 --max-model-len 1024 --quantization awq --batch-size 1
Avg latency: 1.3650233147976298 seconds
10% percentile latency: 1.3638953405432404 seconds
25% percentile latency: 1.3643284866120666 seconds
50% percentile latency: 1.3648834400810301 seconds
75% percentile latency: 1.3656865148805082 seconds
90% percentile latency: 1.3661401799879969 seconds
99% percentile latency: 1.366987234679982 seconds

Mixtral Fp16:
> python benchmarks/benchmark_latency.py --model mistralai/Mixtral-8x7B-Instruct-v0.1 --tensor-parallel-size 2 --input-len 512 --output-len 128 --max-model-len 1024 --batch-size 1
Avg latency: 1.470517885716011 seconds
10% percentile latency: 1.4687837794423104 seconds
25% percentile latency: 1.4693051476497203 seconds
50% percentile latency: 1.4700921354815364 seconds
75% percentile latency: 1.4719352358952165 seconds
90% percentile latency: 1.4728145461529494 seconds
99% percentile latency: 1.4740593627933414 seconds

Followup work:

Deepseek is not supported yet, illegal memory access with TechxGenus/DeepSeek-V2-Lite-Chat-AWQ

github-actions · 2024-07-14T11:17:35Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

AlpinDale · 2024-07-15T20:50:43Z

Thanks for doing this.

dsikka · 2024-08-02T20:58:35Z

/ready

mgoin

This looks to be shaping up well! I appreciate the refactoring of fused_moe. It would be more clear if it was separated out first, but I think we can land it within this PR if needed.

vllm/model_executor/layers/fused_moe/fused_moe.py

mgoin · 2024-08-02T21:36:01Z

vllm/model_executor/layers/fused_moe/fused_moe_awq.py

+
+logger = init_logger(__name__)
+
+NAIVE_THRESHOLD = 1024


This seems a bit high and it is worth commenting how it was calibrated (what model, benchmark, GPU used)

@robertgshaw2-neuralmagic do we know why this is 1024 specifically?

vllm/model_executor/layers/fused_moe/fused_moe_awq.py

vllm/model_executor/layers/quantization/awq.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/layers/quantization/awq.py

dsikka · 2024-08-04T21:26:04Z

@mgoin can't resolve but addressed all but one comment

vllm/model_executor/layers/fused_moe/fused_moe_awq.py

vllm/model_executor/layers/fused_moe/layer.py

vllm/model_executor/models/deepseek_v2.py

dsikka · 2024-08-05T21:15:14Z

Latency Benchmarking with Two 82 GB A100s:

Mixtral Fused MoE with AWQ:
Avg latency: 1.3650233147976298 seconds
10% percentile latency: 1.3638953405432404 seconds
25% percentile latency: 1.3643284866120666 seconds
50% percentile latency: 1.3648834400810301 seconds
75% percentile latency: 1.3656865148805082 seconds
90% percentile latency: 1.3661401799879969 seconds
99% percentile latency: 1.366987234679982 seconds

Mixtral Fp16:
Avg latency: 1.470517885716011 seconds
10% percentile latency: 1.4687837794423104 seconds
25% percentile latency: 1.4693051476497203 seconds
50% percentile latency: 1.4700921354815364 seconds
75% percentile latency: 1.4719352358952165 seconds
90% percentile latency: 1.4728145461529494 seconds
99% percentile latency: 1.4740593627933414 seconds

mgoin

LGTM! We should add a model test, but considering there isn't a small mixtral awq to test, maybe just including a lm-eval large test would be sufficient.

dsikka · 2024-08-09T14:53:05Z

Note: Splitting this PR into two separate PRs.
PR 1/2: #7334

github-actions · 2024-11-08T01:59:24Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-11-08T02:00:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @robertgshaw2-neuralmagic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat added 12 commits July 13, 2024 15:38

added files

d40fd4d

format

f1d5836

stash

16baf11

torch library

03d9d8e

fixed another torch library

54d6a87

first end to end run with tp=1

524a94c

loaded but not running at fp16

febb027

correctness end-to-end!

8bca009

formatted

8527d6e

updared the weight loading logic

36d1d82

stash

6943e80

fixed fp8

71e5129

robertgshaw2-redhat added 7 commits July 14, 2024 11:28

Merge branch 'main' into fused-moe-awq

703e792

merged

5b73064

formatting

2ef2c92

better comments

db33c3f

added

f6f60cd

formatted

d9def7e

stash

16eacd0

dsikka force-pushed the fused-moe-awq branch from cb550ed to 16eacd0 Compare July 30, 2024 02:06

dsikka added 6 commits July 29, 2024 23:06

Merge branch 'main' into fused-moe-awq

0674d2f

clean-up, fix tests

d6a032e

normalize weights to prevent illegal memory

8d52ae5

all MoE tests working

c08a5da

revert to reproduce error

7325e78

update to comply with main

0538dcc

mgoin marked this pull request as ready for review August 2, 2024 20:58

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 2, 2024

mgoin reviewed Aug 2, 2024

View reviewed changes

PR comments

0ba00ab

dsikka force-pushed the fused-moe-awq branch from 4a86201 to 0ba00ab Compare August 4, 2024 21:55

mgoin reviewed Aug 5, 2024

View reviewed changes

fix tpu forward pass; use kwargs

419eb7d

dsikka added 4 commits August 6, 2024 20:05

fix triton import

5666fcb

further fix imports

8013ad4

fix

be34dc0

fix fp8

6e7bbf9

mgoin approved these changes Aug 8, 2024

View reviewed changes

dsikka mentioned this pull request Aug 9, 2024

[Misc] Update Fused MoE weight loading #7334

Merged

mgoin removed the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2024

github-actions bot added the stale label Nov 8, 2024

mergify bot added the needs-rebase label Nov 8, 2024

mgoin closed this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Kernel ] AWQ Fused MoE #6422

[ Kernel ] AWQ Fused MoE #6422

robertgshaw2-redhat commented Jul 14, 2024 •

edited by mgoin

Loading

github-actions bot commented Jul 14, 2024

AlpinDale commented Jul 15, 2024

dsikka commented Aug 2, 2024

mgoin left a comment

mgoin Aug 2, 2024

dsikka Aug 5, 2024

dsikka commented Aug 4, 2024 •

edited

Loading

dsikka commented Aug 5, 2024

mgoin left a comment

dsikka commented Aug 9, 2024

github-actions bot commented Nov 8, 2024

mergify bot commented Nov 8, 2024


		logger = init_logger(__name__)

		NAIVE_THRESHOLD = 1024

[ Kernel ] AWQ Fused MoE #6422

[ Kernel ] AWQ Fused MoE #6422

Conversation

robertgshaw2-redhat commented Jul 14, 2024 • edited by mgoin Loading

github-actions bot commented Jul 14, 2024

AlpinDale commented Jul 15, 2024

dsikka commented Aug 2, 2024

mgoin left a comment

Choose a reason for hiding this comment

mgoin Aug 2, 2024

Choose a reason for hiding this comment

dsikka Aug 5, 2024

Choose a reason for hiding this comment

dsikka commented Aug 4, 2024 • edited Loading

dsikka commented Aug 5, 2024

mgoin left a comment

Choose a reason for hiding this comment

dsikka commented Aug 9, 2024

github-actions bot commented Nov 8, 2024

mergify bot commented Nov 8, 2024

robertgshaw2-redhat commented Jul 14, 2024 •

edited by mgoin

Loading

dsikka commented Aug 4, 2024 •

edited

Loading