Quick note on sudden performance loss when running GGUFs
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Had a couple of GGUFs (Qwen3.5-35B-A3B-APEX-I-Quality and an Unsloth model as well) that suddenly displayed erratic performance characteristics (sudden deep dives from 20+ tg/s down to 5 tg/s), turned out both had been damaged, not unlikely during manual embedding of MTP layers (shouldn't touch the source model from logic pov..). Discovered by using sha256 sum and seeing that things weren't aligned any longer, redownloaded models and all sorted.
TLDR: check sha256sum of model matches correctly if things get iffy.
[link] [comments]
More from r/LocalLLaMA
-
How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model)
May 22
-
BeeLlama v0.2.0 – major DFlash update. Single RTX 3090: Qwen 3.6 27B up to 164 tps (4.40x), Gemma 4 31B up to 177.8 tps (4.93x). Prompt processing speed near baseline.
May 22
-
trained a prompt injection detector using ml-intern and DeepSeek v4 Flash, runs in the browser
May 22
-
ByteShape Qwen3.6-35B-A3B: 30% faster than Unsloth IQ on 6GB VRAM laptop
May 22
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.