Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: Support for Sparse24Bitmask Compressed Models #12097

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Jan 15, 2025

This PR adds support for models compressed using Sparse24BitMaskCompressor to use cutlass 2:4 Kernels

  • Adds support for compressed cases

This diff was manually tested on:

  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_int8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_fp8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_tensor_act_int8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_fp8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-tensor_wts_per_tok_dyn_act_int8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_fp8-BitM
  • nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_tensor_act_int8-BitM

Also added unit tests for the compressed 2:4 fp8, int8, and sparse only cases!!
Notion Doc: https://www.notion.so/SparseBitMask-24-work-15e863ebf65c80dcbc70e6317d552987

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

@rahul-tuli rahul-tuli force-pushed the rahul-bitmask-additions branch from ab892d2 to 02ff821 Compare January 15, 2025 20:59
@rahul-tuli
Copy link
Contributor Author

Add a test file with an 8B 2of4 compressed model for lm_eval_harness in buildkite
Add test cases for:

-> Sparse only
-> fp8 + sparse dynamic per token
-> fp8 scheme
-> int8 dynamic
-> int8 scheme

@rahul-tuli rahul-tuli force-pushed the rahul-bitmask-additions branch from 02ff821 to c38c20a Compare January 22, 2025 18:23
@mergify mergify bot added the ci/build label Jan 22, 2025
Signed-off-by: Rahul Tuli <[email protected]>
Renamed `compressed` to `compressed_weight`
Address review commits from @dsikka

Signed-off-by: Rahul Tuli <[email protected]>
Signed-off-by: Rahul Tuli <[email protected]>
@rahul-tuli rahul-tuli force-pushed the rahul-bitmask-additions branch from 67590ad to 96f376e Compare January 22, 2025 21:44
@rahul-tuli rahul-tuli marked this pull request as ready for review January 22, 2025 21:46
Comment on lines +161 to +162
compressed=layer.compressed,
bitmask=layer.bitmask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete layer.compressed and layer.bitmask after decompressing them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants