-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support for Prometheus metric collection #95
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you as always, @masahi!
Excited to seeing this coming, it will help us greatly to better understand the engine behavior. A few minor comments.
buckets_batched_decode_tokens = (1, 10, 30, 50, 75, 100, 125, 150, 175, 200, 250, 300) | ||
|
||
for label, buckets in [ | ||
(E2E_LATENCY, buckets_e2e_lat), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle exception in these metrics? Is it included in e2e latency buckets for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We record a metric only when a request successfully reaches the point in the code where the metric is instrumented. For e2e latency, that's L165 in stage_engne_worker.py
. So if an exception is raised, the request doesn't reach that point and hence it won't be recoreded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we count the number of each type of exceptions raised? That would be helpful to identify abnormal behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's possible if we have a particular exception in mind. We need to be aware of where that exception could be raised, and add a counter in except:
block. This could be one of follow-up work we can do.
buckets_ttft = (0.1, 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0) | ||
buckets_batched_prefill_tokens = (500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000) | ||
buckets_batched_decode_tokens = (1, 10, 30, 50, 75, 100, 125, 150, 175, 200, 250, 300) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, I think tracking memory & compute utilization would be helpful. With my naive understanding, this might be tricky since it may require per-gpu tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a tool to get such information we can certainly do that. But for memory util (not KV cache util), due to memory profiling and KV cache pre-allocation, we always use 90% of available VRAM.
commit de28200 Author: Masahiro Masuda <[email protected]> Date: Fri Nov 17 01:20:36 2023 +0000 wip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for awesome addition, @masahi!
This PR instruments the engine to collect Prometheus metrics. The collected metrics can be queried by the new endpoint,
metrics
(for examplecurl 'http://127.0.0.1:8000/metrics'
).The following metrics have been enabled in this PR, and others can be easily added as needed.
By combing this endpoint with the Prometheus server and Grafana, we can get nice visualization of metrics over time. See examples below.
KV cache utilization (from 0 to 1)
E2E latency histogram over time visualized as heatmap