r/MachineLearning · May 21, 2026 · 1 min read

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving).

Current setup:

Model: Gemma 4 26B (fine-tuned)
Engine: vLLM
Quantization: FP8
Hardware: H100

Observed latency:

TTFT: ~100–300 ms
E2E latency: ~3–5 seconds

The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size.

I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.

Now I’m considering more serious speculative decoding approaches:

EAGLE / Medusa-style methods
Draft model based speculative decoding
Possibly training a smaller Gemma draft model

Curious to hear from others who’ve worked with Gemma 4 or large distilled/fine-tuned models:

Is this kind of latency expected?
What actually moved the needle for you?
Any bottlenecks I should investigate first before going deeper into speculative decoding?

Would love to hear experiences, benchmarks, or even horror stories :))

submitted by /u/Ok-Rooster-8120
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/MachineLearning