r/LocalLLaMA · · 1 min read

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

VLLM gives 5x speed of llama but quants not available (unsloth/gguf). What to do?

EDIT - IGNORE. I MADE A MISTAKE.

The "better" model was 27b dense, not 35ba3b. Which also proves that 35b is not the best for coding related tasks.

With 27b fp8 on VLLM - the prefil speed is around 1500tokens/sec and token gen is around 25tokens/sec. Ill need to run llama again to see how llama was surprsing faster on token gen 😄

Note that the machine is not fp8 compatible - its ampere gen. so vllm uses marlin to convert

--

Hi - I want to run unsloth dynamic quant on vllm. Why?

  1. vllm is giving faster prefill speed

- Llama - i get 800-1000 tokens/sec

- Vllm - i get 5k-10K tokens/sec

Tried using Qwen3.6-35B-A3B FP8 official. Machine is RTX A6000 - ampere 48gb

  1. Unsloth q8 quant (on llama testing) gives correct pandas code, even official FP8 sucks

Why unsloth quant? For some reason - with my task - writing pandas - unsloth quant at 8bit gives much better results than the official fp8 quant. I dont know why.

(As a side note - all qwen q4 awq/gptq i tried give horrible results for pandas coding)

  1. unsloth does not make safetensors/(any non gguf anymore).

  2. So key question again - how to make unsloth gguf quant run on vllm? (or any gguf quant run on vllm through conversion or something?) Currently vllm gives error - says unsupported architecture

  3. I tried single file gguf for both gemma4 and qwen3.6 moe

Thanks a lot
(edit - deleted old post which did not clearly have performance difference)

----

EDIT - Does it matter - i had to build llama.cpp binary myself (using opencode) after installing cuda toolkit since linux cuda does not have prebuilt binaries

submitted by /u/superloser48
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA