Profiling PyTorch training without accidentally stalling the GPU [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Profiling PyTorch training has an interesting measurement problem: the more you measure, the more you can change the behavior of the run itself.
A simple example is torch.cuda.synchronize(). It gives cleaner timing boundaries, but it also inserts synchronization points into an otherwise asynchronous CUDA workload.
An alternative is to use CUDA events around selected boundaries and read them later, so timing can be captured without forcing synchronization in the hot path. This does not replace PyTorch Profiler or Nsight, but it can work as a lightweight first pass before deeper operator-level profiling.
I wrote a short technical note about this while working on an open-source PyTorch training diagnostics tool:
[link] [comments]
More from r/MachineLearning
-
Cross-species RSA: same learning rules (BP, PC, STDP, FA) tested against both human fMRI and macaque electrophysiology [P]
May 27
-
A Tiny Open-Source Self-Driving AI That Runs on a Phone [P]
May 27
-
What to use for Sign Language Recognition [R]
May 27
-
[R]GNN Model For Fraud Detection Isn't Performing Well[R]
May 27
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.