From ab63134a037dd92e62265c77827f66360330d80b Mon Sep 17 00:00:00 2001 From: SachinVarghese Date: Thu, 2 Jan 2025 15:18:50 -0500 Subject: [PATCH] Update default max_num_batch_tokens for chunked prefill --- docs/source/usage/performance.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/source/usage/performance.md b/docs/source/usage/performance.md index f028e28627a9f..2cd3801bfc82d 100644 --- a/docs/source/usage/performance.md +++ b/docs/source/usage/performance.md @@ -32,8 +32,8 @@ You can enable the feature by specifying `--enable-chunked-prefill` in the comma ```python llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True) # Set max_num_batched_tokens to tune performance. -# NOTE: 512 is the default max_num_batched_tokens for chunked prefill. -# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=512) +# NOTE: 2048 is the default max_num_batched_tokens for chunked prefill. +# llm = LLM(model="meta-llama/Llama-2-7b-hf", enable_chunked_prefill=True, max_num_batched_tokens=2048) ``` By default, vLLM scheduler prioritizes prefills and doesn't batch prefill and decode to the same batch. @@ -49,13 +49,12 @@ This policy has two benefits: - It improves ITL and generation decode because decode requests are prioritized. - It helps achieve better GPU utilization by locating compute-bound (prefill) and memory-bound (decode) requests to the same batch. -You can tune the performance by changing `max_num_batched_tokens`. -By default, it is set to 512, which has the best ITL on A100 in the initial benchmark (llama 70B and mixtral 8x22B). +You can tune the performance by changing `max_num_batched_tokens`. By default, it is set to 2048. Smaller `max_num_batched_tokens` achieves better ITL because there are fewer prefills interrupting decodes. Higher `max_num_batched_tokens` achieves better TTFT as you can put more prefill to the batch. - If `max_num_batched_tokens` is the same as `max_model_len`, that's almost the equivalent to the default scheduling policy (except that it still prioritizes decodes). -- Note that the default value (512) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler. +- Note that the default value (2048) of `max_num_batched_tokens` is optimized for ITL, and it may have lower throughput than the default scheduler. We recommend you set `max_num_batched_tokens > 2048` for throughput.