I wrote a free 15-part series on LLM internals — real math, real tensor shapes, real hardware constraints. All grounded in Gemma 4 12B's actual config.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
If you run open-source models and want to understand what's actually happening under the hood — I spent the last few months writing a 15-part series that covers the full stack from tokenization to production serving.
Most articles are grounded in Gemma 4 12B as the running example.
The full series: Generative AI in Depth
Here's what each article covers and why I think it's worth your time:
1. Tokenisation in Depth BPE, SentencePiece, vocabulary design. Why "tokenizer mismatch" silently breaks fine-tunes. Why Gemma 4's 262,144-token vocabulary costs ~2 GB of VRAM before the model even loads.
2. Inside LLM Inference: Every Calculation from Text to Token Traces every tensor shape through a full Gemma 4 12B forward pass.
3. Attention Mechanisms and KV Cache: From First Principles MHA → MQA → GQA → MLA. How DeepSeek's Multi-Latent Attention compresses K/V into a low-rank latent space — and what that means for vLLM's kernel choices.
4. The Memory Math: What Fits on a GPU? The arithmetic for model weights + KV cache + activations + overhead. How to calculate whether a model fits before you download it. Why the KV cache at 128K context can exceed the model weights themselves.
5. Training vs Inference: Why the Same Model Costs 10× More to Train Gradients, optimizer states, activation checkpointing. Why Gemma 4 12B needs ~200 GB to train but ~24 GB to run. The specific memory multipliers for Adam vs SGD vs 8-bit Adam.
6. Fine-Tuning and Adaptation: LoRA, QLoRA, RLHF, and DPO in Depth How LoRA works mathematically — why rank-16 adapters on a 12B model add only ~1% of parameter count. QLoRA's double-quantization trick. Why DPO trains on preference pairs directly without a reward model.
7. Knowledge Distillation: Making Smaller Models That Punch Above Their Weight Offline vs online distillation. Why reasoning traces (chain-of-thought distillation) transfer so much better than logit matching alone. The execution-gating technique that filters wrong-answer traces before they enter training.
8. A Quantization Primer: Formats, Architecture Sensitivity, and a Gemma 4 Case Study GPTQ, AWQ, GGUF, FP8, and KV cache quantization — with actual file sizes from Bartowski's Gemma 4 GGUF quants. The formula for calculating how much quality you lose per bit.
9. CUDA Kernels and FlashAttention: Why Memory Bandwidth Is the Bottleneck The roofline model, arithmetic intensity, and why decode is memory-bound (AI ≈ 1 FLOPs/byte at B=1). How FlashAttention's tiling eliminates the O(T²) attention matrix from HBM entirely. Flash-Decoding — why standard FA2 uses 1 SM for decode but Flash-Decoding can use 32. CUDA Graph Capture and why it cuts CPU launch overhead.
10. Speculative Decoding: Generating Multiple Tokens Per Step Draft-then-verify. Why it speeds up low-batch inference but reduces throughput at high batch sizes — the math behind why this is counterintuitive. EAGLE vs n-gram vs draft model vs DFlash and MLP tradeoffs.
11. Mixture of Experts: Routing, Sparse Activation, and Why MoE Dominates at Scale How DeepSeek V3's 671B model activates only 37B parameters per token. Router collapse and load balancing. Why expert parallelism is necessary for MoE serving and how it differs from tensor parallelism.
12. Context Length Scaling: RoPE, YaRN, Ring Attention, and the Cost of Long Context Why RoPE extrapolates beyond training length (sometimes). YaRN's interpolation strategy. The memory and compute cost of 1M context — and why most "1M context" models aren't actually usable at that length.
13. LLM Serving in Depth: Batching, Scheduling, and Parallelism PagedAttention internals. Continuous vs static batching. Prefix caching — why it makes agentic workloads (same system prompt, different user messages) dramatically cheaper. Chunked prefill and why it prevents head-of-line blocking.
14. LLM Evaluation in Depth: Benchmarks, Contamination, and What Actually Matters Why MMLU scores are nearly meaningless in 2026. Training data contamination and how to detect it. The benchmarks that actually predict real-world performance — and why vibes-based evals are often more reliable than leaderboards for specific use cases.
15. Which LLM Serving Framework Should You Use? llama.cpp, Ollama, vLLM, SGLang, TensorRT-LLM, TGI, LMDeploy, and mlx-lm — compared on throughput, latency, ease of use, and hardware support. Decision trees for: local single-user, production multi-user, edge/embedded, and Apple Silicon.
There's also a companion vLLM Deep Dive Series (3 parts) that goes deeper into vLLM's internals — PagedAttention, disaggregated serving, all five parallelism strategies, and 60+ supported architectures.
Everything is free, no email required, no paywall. Happy to answer questions in the comments.
[link] [comments]
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.