Add: Support for Sparse24Bitmask Compressed Models #12097

rahul-tuli · 2025-01-15T20:42:21Z

This PR adds support for models compressed using Sparse24BitMaskCompressor to use cutlass 2:4 Kernels

Adds support for compressed cases

This diff was manually tested on:

nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_int8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_fp8-BitM
nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_int8-BitM

Also added unit tests for the compressed 2:4 fp8, int8, and sparse only cases!!
Notion Doc: https://www.notion.so/SparseBitMask-24-work-15e863ebf65c80dcbc70e6317d552987

github-actions · 2025-01-15T20:42:32Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

rahul-tuli · 2025-01-15T22:04:06Z

Add a test file with an 8B 2of4 compressed model for lm_eval_harness in buildkite
Add test cases for:

-> Sparse only
-> fp8 + sparse dynamic per token
-> fp8 scheme
-> int8 dynamic
-> int8 scheme

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

Signed-off-by: Rahul Tuli <[email protected]>

@dsikka

Renamed `compressed` to `compressed_weight` Address review commits from @dsikka Signed-off-by: Rahul Tuli <[email protected]>

Signed-off-by: Rahul Tuli <[email protected]>

mgoin · 2025-01-22T23:10:47Z

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_24.py

+                compressed=layer.compressed,
+                bitmask=layer.bitmask,


Should we delete layer.compressed and layer.bitmask after decompressing them?

rahul-tuli force-pushed the rahul-bitmask-additions branch from ab892d2 to 02ff821 Compare January 15, 2025 20:59

dsikka reviewed Jan 16, 2025

View reviewed changes

rahul-tuli force-pushed the rahul-bitmask-additions branch from 02ff821 to c38c20a Compare January 22, 2025 18:23

mergify bot added the ci/build label Jan 22, 2025

rahul-tuli added 6 commits January 22, 2025 21:43

Add: Support for Sparse24Bitmask Compressed Models

e2e0f43

Signed-off-by: Rahul Tuli <[email protected]>

Fix: mypy errors

c4dd0fa

Signed-off-by: Rahul Tuli <[email protected]>

Removed BitmaskShape Parameter

6f92877

Renamed `compressed` to `compressed_weight` Address review commits from @dsikka Signed-off-by: Rahul Tuli <[email protected]>

Add: lm-eval, fp8, int8 tests

7e5d828

Signed-off-by: Rahul Tuli <[email protected]>

Add: 2:4 Sparse only compressed test

0de4042

Signed-off-by: Rahul Tuli <[email protected]>

Lint

96f376e

Signed-off-by: Rahul Tuli <[email protected]>

rahul-tuli force-pushed the rahul-bitmask-additions branch from 67590ad to 96f376e Compare January 22, 2025 21:44

rahul-tuli marked this pull request as ready for review January 22, 2025 21:46

rahul-tuli requested review from mgoin, simon-mo, robertgshaw2-redhat and tlrmchlsmth as code owners January 22, 2025 21:46

mgoin reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add: Support for Sparse24Bitmask Compressed Models #12097

Add: Support for Sparse24Bitmask Compressed Models #12097

rahul-tuli commented Jan 15, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 15, 2025

rahul-tuli commented Jan 15, 2025

mgoin Jan 22, 2025

Add: Support for Sparse24Bitmask Compressed Models #12097

Are you sure you want to change the base?

Add: Support for Sparse24Bitmask Compressed Models #12097

Conversation

rahul-tuli commented Jan 15, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 15, 2025

rahul-tuli commented Jan 15, 2025

mgoin Jan 22, 2025

Choose a reason for hiding this comment

rahul-tuli commented Jan 15, 2025 •

edited by github-actions bot

Loading