r/LocalLLaMA · June 28, 2026 · 1 min read

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?

#model-release #gpu

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

Sharing popular(also recent) models for reference:

151-250B :

DeepSeek-V4-Flash
Step-3.X-Flash
Command-a-plus-05-2026
Laguna-M.1
MiniMax-M2.X
Qwen3-235B-A22B

100-150B :

GLM-4.5-Air
Qwen3.5-122B-A10B
NVIDIA-Nemotron-3-Super-120B-A12B
Mistral-Small-4-119B-2603
Devstral-2-123B-Instruct-2512
Mistral-Medium-3.5-128B
Llama-4-Scout-17B-16E-Instruct (Yay! got your attention)

<100B :

Llama-3.3-70B-Instruct
Qwen3-Coder-Next
Qwen3-Next-80B-A3B

I see that some people do use Q3(even up to IQ3_XXS) whenever they couldn't run Q4 on their rig. Ex: Noticed that some DGX/SH users do use Q3 of MiniMax-M2 models as Q4 is so tight.

I guess Q1/Q2 won't be good for small/medium size models(~40B size) .... Talking about Agentic coding level. Chatting would be semi-usable quality-wise I think, though I'm not sure.

But I believe it's totally opposite for Big/Large models due to bigger size of the models. So how many of you do use Q1 or Q2 of Big models(100-250B)? How's it & are those enough for you now? Please share your feedback on both Agentic coding, Writing & Chatting stuffs with such quants of those above models. Also please let us know what issues are you facing with Q1/Q2 quants? Ex: Looping issues, Repetition issues, Tool calling issues, etc.,

Personally I don't go below Q4 of small/medium models even though I have only 8GB VRAM on my current laptop. My upcoming rig comes with 96GB VRAM + 128GB RAM so posted this thread. Thought of trying Q1/Q2 of models like NVIDIA-Nemotron-3-Ultra-550B-A55B, GLM-5.X, etc.,

submitted by /u/pmttyji
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA