High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving).
Current setup:
- Model: Gemma 4 26B (fine-tuned)
- Engine: vLLM
- Quantization: FP8
- Hardware: H100
Observed latency:
- TTFT: ~100–300 ms
- E2E latency: ~3–5 seconds
The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size.
I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.
Now I’m considering more serious speculative decoding approaches:
- EAGLE / Medusa-style methods
- Draft model based speculative decoding
- Possibly training a smaller Gemma draft model
Curious to hear from others who’ve worked with Gemma 4 or large distilled/fine-tuned models:
- Is this kind of latency expected?
- What actually moved the needle for you?
- Any bottlenecks I should investigate first before going deeper into speculative decoding?
Would love to hear experiences, benchmarks, or even horror stories :))
[link] [comments]
More from r/MachineLearning
-
Looking for real world comparisons between WALL OSS pi0.6 and OpenVLA[D]
May 21
-
Columbia Machine Learning Summer School (MLSS) 2026 [D]
May 21
-
Masked Diffusion Language Models are Strong and Steerable Text-Based World Models for Agentic RL [R]
May 21
-
l9gpu - open-source GPU observability with workload-level attribution [P]
May 21
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.