An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook.
Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and where the real bottlenecks live. Added mermaid diagrams for the architecture pieces so the flow is easier to follow than a wall of text.
It's a personal learning project, still growing chapter by chapter. I'd value feedback or corrections from anyone who's run inference in production, where my mental model breaks down is exactly what I want to find. Issues and PRs welcome.
[link] [comments]
More from r/MachineLearning
-
Loss functions in Instance Representation Learning [R]
Jun 29
-
Price elasticity model [R]
Jun 29
-
Rejected MICCAI paper: workshop -> journal/conference or directly journal/conference [R]
Jun 29
-
I built a demo agricultural planning system with an AI advisor for small-scale farmers in Nicaragua using NASA data [p]
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.