-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add: Support for Sparse24Bitmask Compressed Models #12097
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
ab892d2
to
02ff821
Compare
Add a test file with an 8B 2of4 compressed model for lm_eval_harness in buildkite
|
vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py
Outdated
Show resolved
Hide resolved
02ff821
to
c38c20a
Compare
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Renamed `compressed` to `compressed_weight` Address review commits from @dsikka Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
67590ad
to
96f376e
Compare
compressed=layer.compressed, | ||
bitmask=layer.bitmask, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we delete layer.compressed
and layer.bitmask
after decompressing them?
This PR adds support for models compressed using
Sparse24BitMaskCompressor
to use cutlass 2:4 KernelsThis diff was manually tested on:
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_int8-BitM
Also added unit tests for the compressed 2:4 fp8, int8, and sparse only cases!!
Notion Doc: https://www.notion.so/SparseBitMask-24-work-15e863ebf65c80dcbc70e6317d552987