-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: decoding speed on long context #11286
Comments
Most of the time, the GPU meet bandwidth bottleneck rather than computing bottleneck, As the prompt length increases, the size of the kv cache that needs to be read even exceeds the size of the model, Go further: Inference latency increases linearly to the context size, primarily due to the time needed to access You can even fit a linear function, x=kv cache size need read, y=time required for one step see more flashdecoding |
GPU L40, Qwen2.5-32B-GPTQ-Int4 |
By default
flashinfer was slightly faster than flash_attn, but I'm not sure if that's still the case You can try vllm + flashinfer see if that improve performance. Looking forward to your benchmark |
i can't found the configuration to use flashinfer, how to use flashinfer in vllm? |
I'm not sure vllm supports the latest released flashinfer v0.2.0 #11314 It is safer to use flashinfer v0.1.6
|
use flashinfer still 12 seconds |
interesting |
Maybe it is a similar issue with #11317 (comment) |
@jeejeelee i try increase max-seq-len-to-capture,but it's useless |
vllm v0 use default scheduler,chunked_prefill performs better for long inputs Please try the configuration below: enable_chunked_prefill = True |
@noooop |
Could you plz provide the more details, such as model ,running script, etc. I can try reproduce your issue if I have bandwith this weekend. |
i only use vscode + REST Client pulgin test v1/chat/completions,prompt is long content of the document,let LLM summarize,the maximum output length requirement is 500。 because there were some issues with sglang 0.4.0, I just tried sglang 0.3.2 again,it's takes 6s |
Have you tried “gptq” kernel? In my case, “gptq” kernel is faster than “marlin” kernel. I'm not sure whether it's a bug or not. My GPU is 3090 |
|
@jeejeelee @noooop I just tried gptq again,and it's basically the same as gptq_marlin |
setting
Offline inferenceprefills
decoding
conclusion
Situations not tested
vllm and sglang use almost the same mlp and attentions implementations, this code has been optimized for years. I'm not very familiar with webserver and need other experts to help. |
vllm output 283 tokens, use 117804 ms how could it happen? @Flynn-Zh https://github.com/noooop/snippet/blob/main/benchmarks/test_gptq/main.py |
result.txt |
modify main.by and run offline test again, the result is: |
I'm very sorry, I added the unsupported parameter enforce_eager to sgl.Engine but didn't test it. Summarize
hardware L40*1 Offline inferenceprefills
using chunked prefill
vllm default scheduler
decoding
conclusion
L40 864GB/s So 4090 prefill is slower than L40, but decoding is almost the same. very reasonable
|
@jeejeelee
|
reference to LLM inference speed of light Qwen2.5-32B-GPTQ-Int4 18G 18G / 864GB/s = 20ms prefills 2.33 s + decoding 20ms * (512-16)= 12.33 s This does not take into account kvcache
|
What black technology does sglang have? |
I've been thinking the same thing.
|
We can run a profiler to investigate it |
@noooop What I understand is that the calculation method in this article is suitable for the MHA model, but qwen2.5 is the GQA model. I don't know if my understanding is correct? |
GQA only affects the calculation of kv cache delay. We do not consider kv cache at all. |
let's try NVIDIA Nsight profileinput_len = 8000 for vllmoverallprefill * 8 & decoding * 16. straightforward Enlarge the decoding partOne decoding step takes 25ms, very reasonable Summarize for for vllm Kernels
for sglangoverallCan only roughly see,prefill * 8 & decoding * 16 Enlarge the decoding partOne decoding step takes 24ms, very reasonable Summarize for for vllm KernelsCan not found cuda kernel information, How to use NVIDIA Nsight profile
https://developer.nvidia.com/nsight-systems/get-started Download for Linux on x86_64 Nsight Systems 2024.7.1 Full Version Download .run Installer apt install nsight-systems Don't work for me
Looking forward to your profile |
I'm not familiar with sglang. Is there any better way to profile the sglang? Or do I need to add parameters to nsys profile ? |
Keep an eye on it--I'm curious why it's happening. Does it only occur with Qwen2.5-32B-Instruct-GPTQ-Int4,or does it affect other models too? |
Yes, it's very strange. I think the 4090 results are obviously reasonable, and the L40 results are very unreasonable. I'm trying to determine how it was triggered. sglang 10.41s feels like it is not read any kvcache in the decoding stage . @Flynn-Zh |
I usually use torch.profiler |
|
for sglangoverall
for vllmoverall
vllm Kernels
conclusionSo I think L40 is slower because vllm does not use Marlin? but why?
|
Is it caused by setting the quantization parameter?
I thought it was the same as quantization = None, but maybe it's not. Force using MarlinLinearKernel seems really fast! args.environs = {
But why did I add quantization = "gptq_marlin" and also use the Marlin kernel? Please try:
|
for 4090
|
Summarize for for vllm Kernels
conclusion
let's double checkWait a minute, there may also be a problem with the Marlin implementation of sglang GPU Memory Bandwidth | 864GB/s
|
Prepare to wait for further testing after #11493 is merged. |
Proposal to improve performance
In our experiments, we found that the decoding speed of vLLM decreases dramatically when the length of the prompt becomes longer.
We fixed the batchsize=90 the decoding speed is 5364 tokens/s when the length of the prompt is within 100, 5500 tokens/s when 100 to 200, and decreases to 782 when 4000 to 8000, and decreases to 273 when greater than 8000.
Report of performance regression
No response
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: