[Frontend][V1] Online serving performance improvements #12287

njhill · 2025-01-21T23:38:20Z

These help in particular with TTFT, ITL variance, and overall throughput.

Break up output processing (detokenization) to avoid blocking the event loop for too long
Freeze the heap after startup to reduce GC overhead/pauses
Optimize a couple of CPU hotspots seen during profiling

Benchmark on A100:

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.2-1B-Instruct --disable-log-requests --port 8001 --max-num-batched-tokens 8192 --no-enable-prefix-caching --uvicorn-log-level=error

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --ignore-eos \
    --port 8001 \
    --save-result \
    --result-dir results \
    --result-filename test.json \
    --num-prompts 6000 \
    --request-rate inf \
    --max-concurrency=400

Before:

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  94.31     
Total input tokens:                      1350511   
Total generated tokens:                  1211959   
Request throughput (req/s):              63.62     
Output token throughput (tok/s):         12850.45  
Total Token throughput (tok/s):          27169.98  
---------------Time to First Token----------------
Mean TTFT (ms):                          229.23    
Median TTFT (ms):                        158.08    
P99 TTFT (ms):                           1050.70   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          30.02     
Median TPOT (ms):                        29.64     
P99 TPOT (ms):                           68.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           28.77     
Median ITL (ms):                         23.19     
P99 ITL (ms):                            386.30    
==================================================

After:

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  88.60     
Total input tokens:                      1350511   
Total generated tokens:                  1211959   
Request throughput (req/s):              67.72     
Output token throughput (tok/s):         13679.34  
Total Token throughput (tok/s):          28922.50  
---------------Time to First Token----------------
Mean TTFT (ms):                          197.34    
Median TTFT (ms):                        168.03    
P99 TTFT (ms):                           1059.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          28.30     
Median TPOT (ms):                        27.75     
P99 TPOT (ms):                           47.38     
---------------Inter-token Latency----------------
Mean ITL (ms):                           26.64     
Median ITL (ms):                         24.38     
P99 ITL (ms):                            65.19     
==================================================

github-actions · 2025-01-21T23:38:30Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

These help in particular with TTFT, and ITL variance. Overall throughput doesn't change much. - Break up output processing (detokenization) to avoid blocking the event loop for too long - Freeze the heap after startup to reduce GC overhead/pauses - Optimize a couple of CPU hotspots seen during profiling Signed-off-by: Nick Hill <[email protected]>

njhill · 2025-01-22T00:21:06Z

vllm/entrypoints/openai/protocol.py

@@ -42,23 +42,31 @@ class OpenAIBaseModel(BaseModel):
    # OpenAI API does allow extra fields
    model_config = ConfigDict(extra="allow")

+    # Cache class field names
+    field_names: ClassVar[Optional[Set[str]]] = None


There was noticeable overhead creating this set every time one of these objects is instantiated.

njhill · 2025-01-22T00:21:09Z

vllm/v1/request.py

-    def output_token_ids(self) -> ConstantList[int]:
-        # Prevent directly appending to the output_token_ids since
-        # all_token_ids should also be updated simultaneously.
-        return ConstantList(self._output_token_ids)


Avoid constructing these objects every time the properties are accessed.

Nice catch!

I actually thought properties were cached after the first call, nice call

I actually thought properties were cached after the first call, nice call

That would involve the use of cached_property.

Signed-off-by: Nick Hill <[email protected]>

robertgshaw2-redhat · 2025-01-22T03:05:35Z

Wow, the impact on P99 ITL is crazy.

robertgshaw2-redhat · 2025-01-22T03:07:13Z

vllm/entrypoints/openai/api_server.py

+        # Mark the startup heap as static so that it's ignored by GC.
+        # Reduces pause times of oldest generation collections.
+        gc.collect()
+        gc.freeze()


Do we need to call unfreeze at some point?

No, this is mostly static stuff that will be around for the lifetime of the process anyhow.

https://www.rippling.com/blog/the-garbage-collector-fights-back

njhill · 2025-01-22T04:29:50Z

Combining with #12298 and increasing the max output processing chunk size to 256 gets higher throughput at the cost of slightly more latency variance.

Since the benchmark I've been running is 400 concurrent requests, the 256 chunk size essentially just means those will be split into two chunks of ~400. If I disable the chunking completely, the throughput increases to 80 req/sec (with the coalescing), but the inter-response latencies become larger and more uneven.

============ Serving Benchmark Result ============
Successful requests:                     6000      
Benchmark duration (s):                  84.70     
Total input tokens:                      1350511   
Total generated tokens:                  1211959   
Request throughput (req/s):              70.84     
Output token throughput (tok/s):         14308.94  
Total Token throughput (tok/s):          30253.69  
---------------Time to First Token----------------
Mean TTFT (ms):                          198.28    
Median TTFT (ms):                        166.40    
P99 TTFT (ms):                           1128.75   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          26.76     
Median TPOT (ms):                        26.05     
P99 TPOT (ms):                           50.04     
---------------Inter-token Latency----------------
Mean ITL (ms):                           29.41     
Median ITL (ms):                         26.83     
P99 ITL (ms):                            75.34     
==================================================

njhill · 2025-01-22T04:38:23Z

It would probably be good to also make OUTPUT_PROCESSING_CHUNK_SIZE overridable via an env var.

vllm/v1/engine/output_processor.py

mgoin · 2025-01-22T04:44:37Z

vllm/v1/request.py

-    def output_token_ids(self) -> ConstantList[int]:
-        # Prevent directly appending to the output_token_ids since
-        # all_token_ids should also be updated simultaneously.
-        return ConstantList(self._output_token_ids)


I actually thought properties were cached after the first call, nice call

vllm/v1/engine/async_llm.py

Signed-off-by: Nick Hill <[email protected]>

…smoothing

mgoin

LGTM!

I ran an lm-eval test with gsm8k as a smoke test and got the same result as v0

VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 8000 --max-num-batched-tokens 8192 --no-enable-prefix-caching

lm_eval --model local-completions --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,tokenized_requests=False --tasks gsm8k --num_fewshot 5
local-completions (model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7718|±  |0.0116|
|     |       |strict-match    |     5|exact_match|↑  |0.6983|±  |0.0126|

mergify · 2025-01-22T18:53:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @njhill.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

# Conflicts: # vllm/envs.py

njhill requested review from WoosukKwon, robertgshaw2-redhat, ywang96, comaniac and alexm-redhat as code owners January 21, 2025 23:38

mergify bot added the frontend label Jan 21, 2025

njhill force-pushed the v1-perf-smoothing branch from cfc5705 to 55dd119 Compare January 21, 2025 23:39

njhill commented Jan 22, 2025

View reviewed changes

Parallelize output socket IO on client side

0e92b61

Signed-off-by: Nick Hill <[email protected]>

robertgshaw2-redhat reviewed Jan 22, 2025

View reviewed changes

mgoin reviewed Jan 22, 2025

View reviewed changes

ywang96 reviewed Jan 22, 2025

View reviewed changes

vllm/v1/engine/async_llm.py Outdated Show resolved Hide resolved

njhill added 2 commits January 22, 2025 08:56

Make max processing chunk size overridable, fix linting

aa7f031

Signed-off-by: Nick Hill <[email protected]>

Merge remote-tracking branch 'refs/remotes/origin/main' into v1-perf-…

e6fc61f

…smoothing

mgoin approved these changes Jan 22, 2025

View reviewed changes

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 22, 2025

mergify bot added the needs-rebase label Jan 22, 2025

Merge remote-tracking branch 'origin/main' into v1-perf-smoothing

eafe7cb

# Conflicts: # vllm/envs.py

mergify bot removed the needs-rebase label Jan 22, 2025

mgoin enabled auto-merge (squash) January 22, 2025 22:18

mgoin merged commit aea9436 into vllm-project:main Jan 22, 2025
51 checks passed

njhill deleted the v1-perf-smoothing branch January 22, 2025 23:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Frontend][V1] Online serving performance improvements #12287

[Frontend][V1] Online serving performance improvements #12287

njhill commented Jan 21, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 21, 2025

njhill Jan 22, 2025

njhill Jan 22, 2025

WoosukKwon Jan 22, 2025

mgoin Jan 22, 2025

DarkLight1337 Jan 22, 2025

robertgshaw2-redhat commented Jan 22, 2025

robertgshaw2-redhat Jan 22, 2025

njhill Jan 22, 2025

njhill commented Jan 22, 2025

njhill commented Jan 22, 2025

mgoin Jan 22, 2025

mgoin left a comment •

edited

Loading

mergify bot commented Jan 22, 2025

[Frontend][V1] Online serving performance improvements #12287

[Frontend][V1] Online serving performance improvements #12287

Conversation

njhill commented Jan 21, 2025 • edited by github-actions bot Loading

Before:

After:

github-actions bot commented Jan 21, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertgshaw2-redhat commented Jan 22, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

njhill commented Jan 22, 2025

njhill commented Jan 22, 2025

Choose a reason for hiding this comment

mgoin left a comment • edited Loading

Choose a reason for hiding this comment

mergify bot commented Jan 22, 2025

njhill commented Jan 21, 2025 •

edited by github-actions bot

Loading

mgoin left a comment •

edited

Loading