Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for Prometheus metric collection #95

Merged
merged 9 commits into from
Dec 6, 2023

Conversation

masahi
Copy link
Member

@masahi masahi commented Dec 1, 2023

This PR instruments the engine to collect Prometheus metrics. The collected metrics can be queried by the new endpoint, metrics (for example curl 'http://127.0.0.1:8000/metrics').

The following metrics have been enabled in this PR, and others can be easily added as needed.

  • Prefill / decode batched token counts histogram
  • E2E latency histogram
  • TTFT histogram
  • The number of cache eviction counter
  • KV cache utilization gauge

By combing this endpoint with the Prometheus server and Grafana, we can get nice visualization of metrics over time. See examples below.

KV cache utilization (from 0 to 1)
スクリーンショット 2023-12-01 17 37 25

E2E latency histogram over time visualized as heatmap
スクリーンショット 2023-12-01 17 36 01

@masahi masahi marked this pull request as ready for review December 1, 2023 10:40
Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you as always, @masahi!
Excited to seeing this coming, it will help us greatly to better understand the engine behavior. A few minor comments.

buckets_batched_decode_tokens = (1, 10, 30, 50, 75, 100, 125, 150, 175, 200, 250, 300)

for label, buckets in [
(E2E_LATENCY, buckets_e2e_lat),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we handle exception in these metrics? Is it included in e2e latency buckets for example?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We record a metric only when a request successfully reaches the point in the code where the metric is instrumented. For e2e latency, that's L165 in stage_engne_worker.py. So if an exception is raised, the request doesn't reach that point and hence it won't be recoreded.

Copy link
Member

@sunggg sunggg Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we count the number of each type of exceptions raised? That would be helpful to identify abnormal behavior.

Copy link
Member Author

@masahi masahi Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's possible if we have a particular exception in mind. We need to be aware of where that exception could be raised, and add a counter in except: block. This could be one of follow-up work we can do.

buckets_ttft = (0.1, 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0)
buckets_batched_prefill_tokens = (500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000)
buckets_batched_decode_tokens = (1, 10, 30, 50, 75, 100, 125, 150, 175, 200, 250, 300)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future, I think tracking memory & compute utilization would be helpful. With my naive understanding, this might be tricky since it may require per-gpu tracking.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a tool to get such information we can certainly do that. But for memory util (not KV cache util), due to memory profiling and KV cache pre-allocation, we always use 90% of available VRAM.

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for awesome addition, @masahi!

@sunggg sunggg merged commit 21ee0bb into octoml:batch-serving Dec 6, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants