r/LocalLLaMA · · 1 min read

How many of you do use Q1 or Q2 of Big models(100-250B)? How's it?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Sharing popular(also recent) models for reference:

151-250B :

  • DeepSeek-V4-Flash
  • Step-3.X-Flash
  • Command-a-plus-05-2026
  • Laguna-M.1
  • MiniMax-M2.X
  • Qwen3-235B-A22B

100-150B :

  • GLM-4.5-Air
  • Qwen3.5-122B-A10B
  • NVIDIA-Nemotron-3-Super-120B-A12B
  • Mistral-Small-4-119B-2603
  • Devstral-2-123B-Instruct-2512
  • Mistral-Medium-3.5-128B
  • Llama-4-Scout-17B-16E-Instruct (Yay! got your attention)

<100B :

  • Llama-3.3-70B-Instruct
  • Qwen3-Coder-Next
  • Qwen3-Next-80B-A3B

I see that some people do use Q3(even up to IQ3_XXS) whenever they couldn't run Q4 on their rig. Ex: Noticed that some DGX/SH users do use Q3 of MiniMax-M2 models as Q4 is so tight.

I guess Q1/Q2 won't be good for small/medium size models(~40B size) .... Talking about Agentic coding level. Chatting would be semi-usable quality-wise I think, though I'm not sure.

But I believe it's totally opposite for Big/Large models due to bigger size of the models. So how many of you do use Q1 or Q2 of Big models(100-250B)? How's it & are those enough for you now? Please share your feedback on both Agentic coding, Writing & Chatting stuffs with such quants of those above models. Also please let us know what issues are you facing with Q1/Q2 quants? Ex: Looping issues, Repetition issues, Tool calling issues, etc.,

Personally I don't go below Q4 of small/medium models even though I have only 8GB VRAM on my current laptop. My upcoming rig comes with 96GB VRAM + 128GB RAM so posted this thread. Thought of trying Q1/Q2 of models like NVIDIA-Nemotron-3-Ultra-550B-A55B, GLM-5.X, etc.,

submitted by /u/pmttyji
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA