l9gpu - open-source GPU observability with workload-level attribution [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
GPU monitoring tools like DCGM give you hardware-level metrics but no workload context. When a node is saturated, you can't tell which experiment, team, or job is responsible without digging through logs.
We built l9gpu to close that gap. It's a node-level agent that exports GPU metrics via OTLP with workload attribution embedded:
- Kubernetes: correlates GPU metrics with pod, namespace, and deployment
- Slurm: correlates with job ID, user, and partition
- LLM inference: native metrics for vLLM, SGLang, and TGI
- Hardware: NVIDIA, AMD MI300X, Intel Gaudi
- 17 pre-built Prometheus alert rules + Grafana dashboards
Derived from Meta's gcm project, extended with K8s attribution, multi-vendor GPU support, and OTLP export. MIT licensed.
https://github.com/last9/gpu-telemetry
Happy to discuss design decisions around the attribution mapping. What is the ML infra community using for GPU cost visibility in shared research clusters?
[link] [comments]
More from r/MachineLearning
-
Looking for real world comparisons between WALL OSS pi0.6 and OpenVLA[D]
May 21
-
Columbia Machine Learning Summer School (MLSS) 2026 [D]
May 21
-
High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]
May 21
-
Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.