gemma 4 e2b quality degrades after ~30-40 continuous inferences on 4gb vram?
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
running gemma e2b via llama-server for continuous background tasks on a 1650 4gb. works great initially but after maybe 30-40 calls the outputs start getting noticeably worse — shorter responses, missing fields in json output, sometimes just empty. restarting llama-server fixes it immediately.
using: flash-attn on, single slot, 6144 context, ngl 15
anyone seen this? is this a kv cache thing or just vram fragmentation over time? if there's a way to handle it without restarting the whole server
[link] [comments]
More from r/LocalLLaMA
-
BitCPM-CANN: Native 1.58-Bit Large Language Model Training on Ascend NPU
May 24
-
GPU VRAM only for small models with llama.cpp: is it possible?
May 24
-
Qwen3.6-35B-A3B vs Gemma4-26B-A4B
May 24
-
Qwen Plays ̶p̶̶o̶̶k̶̶e̶̶m̶̶o̶̶n̶ ? / QWEN PLAYS DCSS! - qwen3.6-35b-a3b@q4_k_xl plays open source roguelike adventure DCSS (and does a decent job)
May 24
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.