-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend][V1] Online serving performance improvements #12287
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
These help in particular with TTFT, and ITL variance. Overall throughput doesn't change much. - Break up output processing (detokenization) to avoid blocking the event loop for too long - Freeze the heap after startup to reduce GC overhead/pauses - Optimize a couple of CPU hotspots seen during profiling Signed-off-by: Nick Hill <[email protected]>
cfc5705
to
55dd119
Compare
@@ -42,23 +42,31 @@ class OpenAIBaseModel(BaseModel): | |||
# OpenAI API does allow extra fields | |||
model_config = ConfigDict(extra="allow") | |||
|
|||
# Cache class field names | |||
field_names: ClassVar[Optional[Set[str]]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was noticeable overhead creating this set every time one of these objects is instantiated.
def output_token_ids(self) -> ConstantList[int]: | ||
# Prevent directly appending to the output_token_ids since | ||
# all_token_ids should also be updated simultaneously. | ||
return ConstantList(self._output_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid constructing these objects every time the properties are accessed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought properties were cached after the first call, nice call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought properties were cached after the first call, nice call
That would involve the use of cached_property
.
Signed-off-by: Nick Hill <[email protected]>
Wow, the impact on P99 ITL is crazy. |
# Mark the startup heap as static so that it's ignored by GC. | ||
# Reduces pause times of oldest generation collections. | ||
gc.collect() | ||
gc.freeze() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to call unfreeze at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is mostly static stuff that will be around for the lifetime of the process anyhow.
https://www.rippling.com/blog/the-garbage-collector-fights-back
Combining with #12298 and increasing the max output processing chunk size to 256 gets higher throughput at the cost of slightly more latency variance. Since the benchmark I've been running is 400 concurrent requests, the 256 chunk size essentially just means those will be split into two chunks of ~400. If I disable the chunking completely, the throughput increases to 80 req/sec (with the coalescing), but the inter-response latencies become larger and more uneven.
|
It would probably be good to also make |
def output_token_ids(self) -> ConstantList[int]: | ||
# Prevent directly appending to the output_token_ids since | ||
# all_token_ids should also be updated simultaneously. | ||
return ConstantList(self._output_token_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually thought properties were cached after the first call, nice call
Signed-off-by: Nick Hill <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
I ran an lm-eval test with gsm8k as a smoke test and got the same result as v0
VLLM_USE_V1=1 vllm serve meta-llama/Llama-3.1-8B-Instruct --disable-log-requests --port 8000 --max-num-batched-tokens 8192 --no-enable-prefix-caching
lm_eval --model local-completions --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,tokenized_requests=False --tasks gsm8k --num_fewshot 5
local-completions (model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=50,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 1
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.7718|± |0.0116|
| | |strict-match | 5|exact_match|↑ |0.6983|± |0.0126|
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/envs.py
These help in particular with TTFT, ITL variance, and overall throughput.
Benchmark on A100:
Before:
After: