r/LocalLLaMA · June 21, 2026 · 4 min read

Best local model for vision - 2nd benchmark update - 21 Jun 2026

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I previously posted the first results of my VLM benchmark. There were a few useful comments and observations I took into account, to revise and expand my benchmark:

I initially did not take into account the Gemma 4 vision budget which defaults to 280, essentially making it useless. I have increased it to maximum level, with the following optimal setttings which were posted here recently: --image-min-tokens 560 --image-max-tokens 2240
I used the -b 4096 -ub 4096 parameters to avoid splitting the image tokens into multiple blocks (default value is 512)
Switched from ollama to llama.cpp
I expanded my dataset from 20 to 30 images, to cover more use cases
I expanded the benchmark to test the impact of thinking vs non-thinking
The first benchmark only included Q4 quants; I expanded it to Q8 quants for small models
The first benchmark only tested each image once; now 3x tests per image

In total, 23 models x 30 images x 3 tests = 2,070 tests (not including failures, tunings, re-runs), 60 to 70 inference hours.

I have three recommendations this time, one per hardware tier:

VRAM tier	Pick	Size	Score	Speed
4–8 GB	Qwen3.5 4B (nothink) @ Q4	3.2 GB	75.5/100	20 s/img
12–16 GB	Qwen3-VL 8B @ Q8 (not Q4)	8.1 GB	74.4/100	26 s/img
24+ GB	Qwen3.6 27B (nothink) @ Q4	16.9 GB	79.6/100	70 s/img

I noticed a few interesting outcomes, which I did not expect:

Thinking mode hurts vision. Every Qwen hybrid thinker scored higher with enable_thinking=false. This is because vision is perception, not reasoning. Thinking adds instability, timeouts, and empty outputs.

MoE size is misleading for vision. MoE models tie with much smaller dense models, and perform worse than equivalent dense models. It makes sense in retrospect if when you see that a MoE is a collection of small models. Their big total parameter count buys knowledge breadth, not perception depth which scales with density.

Q8 is not a guaranteed improvement. It improves Gemma 4 (more consistent, less hallucinations), cripples Qwen hybrid thinkers (they spend too long thinking, resulting in frequent timeouts). The only Q8 that's a strict win is Qwen3-VL 8B-Q8.

Here are the full quality ranking, sorted by effective score (raw × completion rate). σ = stability across 3 runs.

#	Variant	Quant	Mode	Score	σ	Successful	Note
1	Qwen3.6 27B	Q4	nothink	79.6	0.24	90/90	Champion
2	Qwen3.6 27B	Q4	think	78.2	0.26	81/90	Same model, slower
3	Qwen3.6 35B-A3B	Q4	nothink	76.4	0.55	90/90	MoE
4	Qwen3.5 4B	Q4	nothink	75.5	0.48	90/90	Best pts/GB
5	GLM-4.6V-Flash 9B	Q4	—	75.1	0.53	90/90	Best for chinese OCR
6	Qwen3.6 35B-A3B	Q4	think	75.0	0.31	90/90	MoE
7	Gemma 4 31B	Q4	—	74.6	0.45	90/90	Slow (93 s)
8	Qwen3-VL 8B	Q8	—	74.4	0.33	90/90	Only perfect Q8
9	Qwen3-VL 8B	Q4	—	73.1	0.52	90/90
10	Qwen3.5 9B	Q4	nothink	73.1	0.58	90/90
11	Gemma 4 26B-A4B	Q4	—	72.7	0.51	90/90
12	Qwen3.5 9B	Q4	think	72.7	0.52	90/90
13	GLM-9B	Q8	—	73.4 raw / 68.5 eff	0.51	84/90	Drop vs Q4
14	Qwen3.5 4B	Q4	think	70.6	0.77	90/90	Unstable
15	Qwen3-VL 4B	Q4	—	65.9	0.76	90/90	Degenerates
16	Qwen3.5 4B	Q8	nothink	65.7	0.51	partial	Drop vs Q4
17	Qwen3-VL 4B	Q8	—	65.3	1.03	87/93	Worst σ
18	Gemma 4 12B	Q8	—	76.6 raw / 59.7 eff	0.28	74/95	22% timeouts
19	Gemma 4 12B	Q4	—	64.1	0.66	90/90	Hallucinations
20	Gemma 4 E4B	Q8	—	63.9	0.46	78/90
21	Gemma 4 E4B	Q4	—	58.8	0.60	90/90	Wrong counts
22	Qwen3.5 9B	Q8	nothink	partial	—	~85% fail	Unusable
23	Qwen3.5 9B	Q8	think	partial	—	~60% fail	Unusable

Here is bit more info about some of those models, that the above numbers cannot express, based on reading their actual output:

Qwen3.6-27B (Q4=16.9GB) : Best quality, best stability, no failures with thinking disabled. The no-thinking mode has a huge beneficial on speed, and avoids the timeouts due to reasoning too long. Gives very direct answers.

Qwen3.6-35B-A3B (Q4=21.9GB) : Based on the numbers it might appear like a good speedy alternatives, but it rarely performs better than smaller models. Biggest problem, apart from its size, is the huge variance and unpredictability of its responses. Skip it, not worth using MoE for vision.

Qwen3-VL-8B-Instruct (Q4=5.8GB Q8=8.1GB) : The only model with 100% reliability on Q8. Q8 brings big over Q4, for both quality and consistency.

Qwen3.5-4B (Q4=3.2GB) : Use with thinking disabled; when enabled, on dense images, it can easily exhaust its token budget and error, or timeout. Q8 was a lot worse than Q4, with again timeouts on dense images. None of those problems with Q4 non-thinking.

Test methodology

specs: Apple M2 Max, 96GB RAM
runtime: llama.cpp b9690 via llama-server
models: 11 base models, Q4_K_M; Q8_0 added for 7 of the smaller ones
hybrid thinking models (Qwen3.5/3.6) tested both with and without thinking enabled
30 images across screenshots, photos, posters, art, medical, scientific graphs, dense scenes, and multilingual content
3 runs per (model × image), median run scored
hybrid scoring: 40% deterministic probes (OCR, counts, hallucination checks) + 60% LLM judge based on human created detailed ground truth description for each image
timeout: 300s per call (fail fast on runaway thinking)

submitted by /u/ex-arman68
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

I have three recommendations this time, one per hardware tier:

Here are the full quality ranking, sorted by effective score (raw × completion rate). σ = stability across 3 runs.

Test methodology

Discussion (0)

More from r/LocalLLaMA