-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeepSeek-Coder-V2-Lite-Instruct not working when quantized to FP8 using AutoFP8 #29
Comments
Hey @Syst3m1cAn0maly we don't support quantization in vLLM for non-Mixtral MoEs yet. We are currently undergoing a refactor to support Qwen2 and DeepSeek-V2 vllm-project/vllm#6088 |
Thank you for the efforts. Looking forward to FP8 support for DSv2❤️ |
Working on this today |
This should do it for you: |
Thanks a lot. |
You need to skip the routing gate: # Define quantization config with static activation scales
quantize_config = BaseQuantizeConfig(
quant_method="fp8",
activation_scheme="static",
# skip the lm head and expert gate
ignore_patterns=["re:.*lm_head", "re:.*gate.weight"],) The other thing I'm not sure about is the following layers: self_attn.kv_a_proj_with_mqa
self_attn.kv_b_proj Im working on seeing how sensitive they are now |
Thanks, I will try with these settings. |
FYI - config above is good. But needed one more tweak on vllm side. |
@robertgshaw2-neuralmagic thanks a lot for the work |
I tested today and it now works as expected, thanks ! |
Hi !
I quantized DeepSeek-Coder-V2-Lite-Instruct to FP8 using AutoFP8 but when I try to run it with vLLM I get the following error :
RuntimeError: "cat_cuda" not implemented for 'Float8_e4m3fn'
I ran the quantization using this script :
and I got the following output :
What can I do to quantize correctly this kind of model ?
The text was updated successfully, but these errors were encountered: