-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support for Prometheus metric collection #95
Changes from all commits
8f8f704
8d65476
619d9d3
1a25c51
7bfbd02
dfe0dfe
ba752b2
5bd19cb
d4a0fd4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
from .metrics_labels import * | ||
from prometheus_client import Counter, Histogram, Gauge | ||
|
||
|
||
class PrometheusMetrics: | ||
def __init__(self): | ||
self.counters = {} | ||
self.histograms = {} | ||
self.gauges = {} | ||
|
||
for label in [NUM_CACHE_EVICTONS]: | ||
self.counters[label] = Counter(label, label) | ||
|
||
buckets_e2e_lat = (0.5, 2.5, 4.5, 6.5, 8.5, 10.5, 12.5, 14.5, 16.5, 18.5) | ||
buckets_ttft = (0.1, 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4, 2.7, 3.0) | ||
buckets_batched_prefill_tokens = (500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500, 5000, 6000, 7000, 8000) | ||
buckets_batched_decode_tokens = (1, 10, 30, 50, 75, 100, 125, 150, 175, 200, 250, 300) | ||
|
||
for label, buckets in [ | ||
(E2E_LATENCY, buckets_e2e_lat), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do we handle exception in these metrics? Is it included in e2e latency buckets for example? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We record a metric only when a request successfully reaches the point in the code where the metric is instrumented. For e2e latency, that's L165 in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we count the number of each type of exceptions raised? That would be helpful to identify abnormal behavior. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's possible if we have a particular exception in mind. We need to be aware of where that exception could be raised, and add a counter in |
||
(FIRST_TOKEN_LATENCY, buckets_ttft), | ||
(BATCHED_PREFILL_TOKENS, buckets_batched_prefill_tokens), | ||
(BATCHED_DECODE_TOKENS, buckets_batched_decode_tokens), | ||
]: | ||
self.histograms[label] = Histogram(label, label, buckets=buckets) | ||
|
||
for label in [KV_CACHE_UTILIZATION]: | ||
self.gauges[label] = Gauge(label, label) | ||
|
||
def _lookup(self, metrics_dict, label): | ||
if label in metrics_dict: | ||
return metrics_dict[label] | ||
|
||
return RuntimeError(f"No metric {label} found.") | ||
|
||
def counter(self, label: str): | ||
return self._lookup(self.counters, label) | ||
|
||
def histogram(self, label: str): | ||
return self._lookup(self.histograms, label) | ||
|
||
def gauge(self, label: str): | ||
return self._lookup(self.gauges, label) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
NUM_CACHE_EVICTONS = "num_cache_evictions" | ||
E2E_LATENCY = "e2e_latency" | ||
FIRST_TOKEN_LATENCY = "first_token_latency" | ||
KV_CACHE_UTILIZATION = "kv_cache_utilization" | ||
BATCHED_PREFILL_TOKENS = "batched_prefill_tokens" | ||
BATCHED_DECODE_TOKENS = "batched_decode_tokens" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future, I think tracking memory & compute utilization would be helpful. With my naive understanding, this might be tricky since it may require per-gpu tracking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a tool to get such information we can certainly do that. But for memory util (not KV cache util), due to memory profiling and KV cache pre-allocation, we always use 90% of available VRAM.