r/MachineLearning · · 1 min read

l9gpu - open-source GPU observability with workload-level attribution [P]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

GPU monitoring tools like DCGM give you hardware-level metrics but no workload context. When a node is saturated, you can't tell which experiment, team, or job is responsible without digging through logs.

We built l9gpu to close that gap. It's a node-level agent that exports GPU metrics via OTLP with workload attribution embedded:

- Kubernetes: correlates GPU metrics with pod, namespace, and deployment

- Slurm: correlates with job ID, user, and partition

- LLM inference: native metrics for vLLM, SGLang, and TGI

- Hardware: NVIDIA, AMD MI300X, Intel Gaudi

- 17 pre-built Prometheus alert rules + Grafana dashboards

Derived from Meta's gcm project, extended with K8s attribution, multi-vendor GPU support, and OTLP export. MIT licensed.

https://github.com/last9/gpu-telemetry

Happy to discuss design decisions around the attribution mapping. What is the ML infra community using for GPU cost visibility in shared research clusters?

submitted by /u/bakibab
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning