r/MachineLearning · · 1 min read

High E2E latency on fine-tuned Gemma 4 26B despite low TTFT [R]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving).

Current setup:

  • Model: Gemma 4 26B (fine-tuned)
  • Engine: vLLM
  • Quantization: FP8
  • Hardware: H100

Observed latency:

  • TTFT: ~100–300 ms
  • E2E latency: ~3–5 seconds

The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size.

I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.

Now I’m considering more serious speculative decoding approaches:

  • EAGLE / Medusa-style methods
  • Draft model based speculative decoding
  • Possibly training a smaller Gemma draft model

Curious to hear from others who’ve worked with Gemma 4 or large distilled/fine-tuned models:

  • Is this kind of latency expected?
  • What actually moved the needle for you?
  • Any bottlenecks I should investigate first before going deeper into speculative decoding?

Would love to hear experiences, benchmarks, or even horror stories :))

submitted by /u/Ok-Rooster-8120
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning