r/LocalLLaMA · · 5 min read

Benchmarks of 20 small LLMs on a 6GB RTX 4050

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I'm looking for models that can run on my GPU and actually do something useful. I think that any small difference could be a "big" improvement, because they are all so small.

So I went to the LM studio database and searched many variants from the same family, trying to select the newer models. Then asked claude to select known benchmarks and then run some qualitative tests.

Now I'll try to test with real use cases and then select a "team".

Most of the people runs local with more powerful machines. But the majority of the people barely has a 6gb gpu. So this review may help them.

Below goes the report:

The problem. I want local models doing repetitive overnight work (file organization, tagging, log triage) on a 6GB laptop GPU — zero cost, private, no rate limits. The real question isn't "which model is best" but "which of these specific quants actually fit in 6GB and behave correctly on my tasks." Leaderboard scores don't answer that: they're run on full-precision weights and generic benchmarks, not the Q4/Q6 GGUF you'll actually load.

Why qualitative probing instead of full benchmark suites. Running BFCL-v3/v4 + IFEval + MMLU across 20 models on one 6GB GPU is on the order of days-to-a-week of compute, and most of that signal is already published per model family. What's not published is how a given quant behaves on the exact behaviors I need. So I built a fixed 6-probe set targeting those behaviors — (1) parseable tool-call, (2) multi-turn tool-call (does it chain with the real tool result or hallucinate a placeholder), (3) strict JSON, (4) instruction adherence (IFEval-style), (5) plan decomposition, (6) no path hallucination, plus a GSM8K-style arithmetic check — judged the outputs directly, and triangulated against published BFCL/IFEval to catch quant-level regressions. That turns a week into ~1 hour and tests the thing that actually matters. Then a separate performance pass measured prefill (prompt-processing) speed and generation tok/s at 1k/8k/32k context, N=5 each, on LM Studio's OpenAI-compatible API.

The 20 models. Granite 4.1 3B (lmstudio-community, unsloth, nikolaykozloff Q6/Q8) · Granite-3B-function-calling-xLAM (Salesforce/unsloth) · Granite-3B-sft-claude-opus-reasoning · Granite 4.1 8B · Granite 4.1 8B base · LFM2.5-8B-A1B (liquidai official, unsloth, RemySkye-i1) · Gemma-4-e2b (google base, agentic, ×opus-4.7-turbo, ×deepseek-v4) · LFM2.5-1.2B-Instruct · LFM2.5-VL-1.6B (liquidai, unsloth) · Qwen3.5-4B (base, claude-4.6-opus-reasoning-distilled) · Nemotron-3-Nano-4B.

Results (gen tok/s, N=5, σ<2.5 throughout; VRAM = full-GPU load):

Model VRAM @1k @8k @32k Max ctx (GPU) Note
lfm2.5-1.2b-instruct 1.9G 129 118 102 256k clean, fast
unsloth/lfm2.5-vl-1.6b 3.0G 207 182 142 128k fastest overall (vision)
liquidai/lfm2.5-vl-1.6b 2.7G 128 115 100 256k vision
liquidai/lfm2.5-8b-a1b 5.4G 99 97 90 64k MoE, holds 32k well
unsloth/lfm2.5-8b-a1b 4.6G 121 112 102 128k fast but drops files
lfm2.5-8b-a1b-i1 5.4G 108 99 95 32k reasoning variant
gemma-4-agentic-e2b 2.4G 82 78 70 256k lightest, holds 32k
google/gemma-4-e2b (base) 3.6G 78 79 69 256k base, noisy
gemma-4-e2b×opus-turbo 2.4G 82 78 71 256k broken chat template
gemma4-e2b-deepseek 2.4G 83 78 71 256k hallucinated paths
unsloth/granite-4.1-3b 4.8G 70 60 40 32k quality ≈ 8B
granite-3b-xLAM (fc) 4.8G 71 61 41 32k no edge vs base 3B
lmstudio/granite-4.1-3b 4.9G 66 59 42 32k solid baseline
granite-3b-sft-reasoning 4.8G 68 58 39 32k reasoning tax
nikolaykozloff/granite-3b 4.6G 45 40 24k hallucinated fn name
nvidia/nemotron-3-nano-4b 3.7G 58 56 48 128k least ctx-degradation
qwen3.5-4b-distilled 5.1G 52 50 43 32k reasoning, verbose
qwen3.5-4b (base) 5.6G 52 49 43 32k fine, unremarkable
granite-4.1-8b base 5.1G 38 32 ~10k base, hallucinates
granite-4.1-8b 5.6G 28 25 ~10k slow + ctx-capped

Three cross-cutting findings: (a) reasoning-tuned models cost, they don't fail — with a tight token cap they look broken (truncated mid-thought), but given room they answer correctly at 2–3× the tokens; that's a latency/cost signal for batch work, not a quality reason to cut them (though two still dropped a file in open-ended decomposition even with budget). (b) Third-party fine-tunes are a landmine — hallucinated function names, a broken jinja chat template (dead on arrival for multi-turn tool calls), hallucinated paths; the base/official-instruct builds were consistently safer. (c) Context tax is real and uniform — every model loses ~20–35% gen speed from 1k→32k, with no thermal throttling across N=5.

The picks.

LFM2.5-1.2B-Instruct — the cheap, always-on model. 1.9GB VRAM, 1.5s load, 129 tok/s and clean on JSON / instruction-adherence / no-hallucination probes. Its weak planning is irrelevant for a low-stakes always-resident role. Highest prefill in the whole set (~8.5k tok/s at 8k), so it ingests short inputs near-instantly.

Granite-4.1-3B (instruct) — the quality-per-VRAM baseline. On my probes it matched Granite-8B on output quality while running 2–3× faster (60 tok/s at 8k vs 25), and it's the only dense 3B that cleanly holds 32k context. Notably the "function-calling-xLAM" fine-tune showed no advantage once tested multi-turn — the single-turn impression that it chained tools better collapsed under a proper multi-turn probe. Use the plain instruct.

Gemma-4-agentic-e2b — the surprise. Just 2.4GB VRAM (lightest non-trivial model here), holds 256k context, and sustains 70 tok/s at 32k with high prefill (~3.8k tok/s). It gave clean, complete decomposition plans. It's the one model flexible enough to act as either a light orchestrator or a fast worker, which matters when you're juggling roles in 6GB.

Nemotron-3-Nano-4B — the long-context worker. Slower at small context (58 tok/s at 1k) but it degrades the least — still 48 tok/s at 32k where the Granite-3Bs fall to ~41 — at only 3.7GB and a 128k ceiling. Best choice when the worker has to read a large input in one shot.

LFM2.5-8B-A1B (liquidai) — the orchestrator, and the headline result. This 8B/1B-active MoE does 90 tok/s at 32k context for ~5.4GB. The obvious dense alternative, Granite-8B, does 25–28 tok/s and caps out around 10k context for the same VRAM — so the MoE is 3–4× faster with 3× the usable context. I tested the unsloth build too (faster at 102 tok/s and 128k context) but it dropped a file in open-ended decomposition even with a generous token budget, so the official liquidai build wins on completeness; unsloth stays as a speed fallback.

Takeaways. Benchmark on your own quants and your own tasks — published scores won't catch a broken chat template or a quant that hallucinates function names. On VRAM-constrained hardware an MoE punches far above its parameter count. And a tight, targeted probe set judged by hand gets you a defensible decision in an hour instead of a week of GPU time.

submitted by /u/drfritz2
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA