ROCm with PyTorch and PyTorch Lightning seems to still suck for research [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
So I asked about people's experiences with ROCm in a post a few weeks or so ago
https://www.reddit.com/r/MachineLearning/comments/1t6cng3/rocm_status_in_mid_2026_d/
I actually went and procured a RX 7900XTX reference version to give it a try
My discovery is that it kind of still sucks
I have a small codebase for training flow matching models (SANA Architecture), which runs fine on my RTX3090s. But the moment I ported it across to ROCm it was NaNs absolutely everywhere. Forward passes were absolutely fine, but the moment you called backwards() all bets were off. The code was kept identical, apart from altering the pip environment to point to torch2.12 with ROCm7.2 instead of CUDA
Trying everything from switching between bf16, fp32, to tweaking various environment variables yielded nothing.
Unless there's some trick I'm missing, I get the feeling that ROCm is still seriously behind.
I tried running the nanoGPT training script, which ran perfectly
My intuition is that the ROCm people have probably tested their stack on established well known codebases. But, it's still remarkably fragile on even slightly uncommon code.
[link] [comments]
More from r/MachineLearning
-
Doubts Urgent Guys![R]
May 15
-
Struggling with Overfitting on Medical Imaging Task [D]
May 15
-
Notes from evaluating a customer support chat agent system: heuristic evaluators give false signal, retrieval bugs masquerade as LLM failures, and the cost/quality Pareto frontier is rarely where you think [D]
May 15
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion [R]
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.