Benchmarks of 20 small LLMs on a 6GB RTX 4050
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small.
So I went to the LM studio database and searched many variants from the same family, trying to select the newer models. Then asked claude to select known benchmarks and then run some qualitative tests.
Now I'll try to test with real use cases and then select a "team".
Most of the people runs local with more powerful machines. But the majority of the people barely has a 6gb gpu. So this review may help them.
Below goes the report:
The problem. I want local models doing repetitive overnight work (file organization, tagging, log triage) on a 6GB laptop GPU — zero cost, private, no rate limits. The real question isn't "which model is best" but "which of these specific quants actually fit in 6GB and behave correctly on my tasks." Leaderboard scores don't answer that: they're run on full-precision weights and generic benchmarks, not the Q4/Q6 GGUF you'll actually load.
Why qualitative probing instead of full benchmark suites. Running BFCL-v3/v4 + IFEval + MMLU across 20 models on one 6GB GPU is on the order of days-to-a-week of compute, and most of that signal is already published per model family. What's not published is how a given quant behaves on the exact behaviors I need. So I built a fixed 6-probe set targeting those behaviors — (1) parseable tool-call, (2) multi-turn tool-call (does it chain with the real tool result or hallucinate a placeholder), (3) strict JSON, (4) instruction adherence (IFEval-style), (5) plan decomposition, (6) no path hallucination, plus a GSM8K-style arithmetic check — judged the outputs directly, and triangulated against published BFCL/IFEval to catch quant-level regressions. That turns a week into ~1 hour and tests the thing that actually matters. Then a separate performance pass measured prefill (prompt-processing) speed and generation tok/s at 1k/8k/32k context, N=5 each, on LM Studio's OpenAI-compatible API.
The 20 models. Granite 4.1 3B (lmstudio-community, unsloth, nikolaykozloff Q6/Q8) · Granite-3B-function-calling-xLAM (Salesforce/unsloth) · Granite-3B-sft-claude-opus-reasoning · Granite 4.1 8B · Granite 4.1 8B base · LFM2.5-8B-A1B (liquidai official, unsloth, RemySkye-i1) · Gemma-4-e2b (google base, agentic, ×opus-4.7-turbo, ×deepseek-v4) · LFM2.5-1.2B-Instruct · LFM2.5-VL-1.6B (liquidai, unsloth) · Qwen3.5-4B (base, claude-4.6-opus-reasoning-distilled) · Nemotron-3-Nano-4B.
Results (gen tok/s, N=5, σ<2.5 throughout; VRAM = full-GPU load):
| Model | VRAM | @1k | @8k | @32k | Max ctx (GPU) | Note |
|---|---|---|---|---|---|---|
| lfm2.5-1.2b-instruct | 1.9G | 129 | 118 | 102 | 256k | clean, fast |
| unsloth/lfm2.5-vl-1.6b | 3.0G | 207 | 182 | 142 | 128k | fastest overall (vision) |
| liquidai/lfm2.5-vl-1.6b | 2.7G | 128 | 115 | 100 | 256k | vision |
| liquidai/lfm2.5-8b-a1b | 5.4G | 99 | 97 | 90 | 64k | MoE, holds 32k well |
| unsloth/lfm2.5-8b-a1b | 4.6G | 121 | 112 | 102 | 128k | fast but drops files |
| lfm2.5-8b-a1b-i1 | 5.4G | 108 | 99 | 95 | 32k | reasoning variant |
| gemma-4-agentic-e2b | 2.4G | 82 | 78 | 70 | 256k | lightest, holds 32k |
| google/gemma-4-e2b (base) | 3.6G | 78 | 79 | 69 | 256k | base, noisy |
| gemma-4-e2b×opus-turbo | 2.4G | 82 | 78 | 71 | 256k | broken chat template |
| gemma4-e2b-deepseek | 2.4G | 83 | 78 | 71 | 256k | hallucinated paths |
| unsloth/granite-4.1-3b | 4.8G | 70 | 60 | 40 | 32k | quality ≈ 8B |
| granite-3b-xLAM (fc) | 4.8G | 71 | 61 | 41 | 32k | no edge vs base 3B |
| lmstudio/granite-4.1-3b | 4.9G | 66 | 59 | 42 | 32k | solid baseline |
| granite-3b-sft-reasoning | 4.8G | 68 | 58 | 39 | 32k | reasoning tax |
| nikolaykozloff/granite-3b | 4.6G | 45 | 40 | — | 24k | hallucinated fn name |
| nvidia/nemotron-3-nano-4b | 3.7G | 58 | 56 | 48 | 128k | least ctx-degradation |
| qwen3.5-4b-distilled | 5.1G | 52 | 50 | 43 | 32k | reasoning, verbose |
| qwen3.5-4b (base) | 5.6G | 52 | 49 | 43 | 32k | fine, unremarkable |
| granite-4.1-8b base | 5.1G | 38 | 32 | — | ~10k | base, hallucinates |
| granite-4.1-8b | 5.6G | 28 | 25 | — | ~10k | slow + ctx-capped |
Three cross-cutting findings: (a) reasoning-tuned models cost, they don't fail — with a tight token cap they look broken (truncated mid-thought), but given room they answer correctly at 2–3× the tokens; that's a latency/cost signal for batch work, not a quality reason to cut them (though two still dropped a file in open-ended decomposition even with budget). (b) Third-party fine-tunes are a landmine — hallucinated function names, a broken jinja chat template (dead on arrival for multi-turn tool calls), hallucinated paths; the base/official-instruct builds were consistently safer. (c) Context tax is real and uniform — every model loses ~20–35% gen speed from 1k→32k, with no thermal throttling across N=5.
The picks.
LFM2.5-1.2B-Instruct — the cheap, always-on model. 1.9GB VRAM, 1.5s load, 129 tok/s and clean on JSON / instruction-adherence / no-hallucination probes. Its weak planning is irrelevant for a low-stakes always-resident role. Highest prefill in the whole set (~8.5k tok/s at 8k), so it ingests short inputs near-instantly.
Granite-4.1-3B (instruct) — the quality-per-VRAM baseline. On my probes it matched Granite-8B on output quality while running 2–3× faster (60 tok/s at 8k vs 25), and it's the only dense 3B that cleanly holds 32k context. Notably the "function-calling-xLAM" fine-tune showed no advantage once tested multi-turn — the single-turn impression that it chained tools better collapsed under a proper multi-turn probe. Use the plain instruct.
Gemma-4-agentic-e2b — the surprise. Just 2.4GB VRAM (lightest non-trivial model here), holds 256k context, and sustains 70 tok/s at 32k with high prefill (~3.8k tok/s). It gave clean, complete decomposition plans. It's the one model flexible enough to act as either a light orchestrator or a fast worker, which matters when you're juggling roles in 6GB.
Nemotron-3-Nano-4B — the long-context worker. Slower at small context (58 tok/s at 1k) but it degrades the least — still 48 tok/s at 32k where the Granite-3Bs fall to ~41 — at only 3.7GB and a 128k ceiling. Best choice when the worker has to read a large input in one shot.
LFM2.5-8B-A1B (liquidai) — the orchestrator, and the headline result. This 8B/1B-active MoE does 90 tok/s at 32k context for ~5.4GB. The obvious dense alternative, Granite-8B, does 25–28 tok/s and caps out around 10k context for the same VRAM — so the MoE is 3–4× faster with 3× the usable context. I tested the unsloth build too (faster at 102 tok/s and 128k context) but it dropped a file in open-ended decomposition even with a generous token budget, so the official liquidai build wins on completeness; unsloth stays as a speed fallback.
Takeaways. Benchmark on your own quants and your own tasks — published scores won't catch a broken chat template or a quant that hallucinates function names. On VRAM-constrained hardware an MoE punches far above its parameter count. And a tight, targeted probe set judged by hand gets you a defensible decision in an hour instead of a week of GPU time.
[link] [comments]
More from r/LocalLLaMA
-
Minimax M3 appears to have no political censorship
Jun 2
-
StepFun 3.5 MTP by pwilkin · Pull Request #23274 · ggml-org/llama.cpp
Jun 2
-
I have become George Jetson: my job is now Yes/No supervision for a machine I don’t fully understand.
Jun 2
-
1-bit Bonsai Image 4B and Ternary Bonsai Image 4B Image Generation for Local Devices with just 0.93 GB and 1.21 GB respectively of Diffusion Transformer Footprint. So tiny!
Jun 2
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.