Skip to content

Latest commit

 

History

History
87 lines (63 loc) · 3.45 KB

MONITORING.md

File metadata and controls

87 lines (63 loc) · 3.45 KB

Monitoring & Telemetry

  • OPI adoped https://opentelemetry.io/ for DPUs

  • OPI goal is to pick 1 standard protocol that

    • all vendors can implement (both linux and non-linux based)
    • DPU consumers can integrate once in their existing monitoring systems and tools
  • OpenTemetry suports those data sources

    • Traces
    • Metrics
    • Logs

What is mandated by OPI

Collector deploy options

OPI Telemetry Architecture

  • OpenTemetry collector supports several deployments:

    • Deploy as side car inside every pod
    • Deploy another one as aggregator per Node
    • Deploy another one as super-aggregator per Cluster
  • The benefits of having multiple collectors at different levels are:

    • Increased redundancy
    • Enrichment
    • Filtering
    • Separating trust domains
    • Batching
    • Sanitization
  • Recommendation (reference)

    • micro-aggregator inside each DPU/IPU
    • macro-aggregator between DPUs
      • macro-aggregator can run on the host with DPU/IPU or on a separate host

System Monitoring

Tracing

  • Tracing inside DPU/IPU (more tight SDK integration into our service and IPDK), streaming to zipkin/jaeger
  • TODO: need more details and examples

Logging

Examples

questions to (eventually remove this section)

  • Is there integration of OTEL with kvm or esx ?
  • Use case of standalone DPU, not attached to server. Still runs OTEL collector

Working items

  • #92 Starting new workstream to find out set of common metrics across vendors that OPI will mandate
    • Action items on Marvell, Nvidia, Intell to come up with the list and present on the next meeting
  • #93 Starting new POC with OTEL SDK and hello world app
  • #94 Continue working on existing telegraf example and enhance with more metrics