r/LocalLLaMA · · 1 min read

Question: Llama cpp, whats good right now for: MTP, KV cache quant, Long context.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Used the vllm version of https://github.com/noonghunna/club-3090

It worked fine for myabe 20 40k context, havent tried the new one. Anyone used the new llama.cpp patched one for single 3090? The project is starting to seem very bloated, at least readme wise.

I use https://github.com/Indras-Mirror/llama.cpp-mtp, I get 60tks with long context.

On mainline llama.cpp and q4 cache I get 60tks but with context filling up fast it drops to 20tks.

Are there any better options, and what is your experience?

EDIT: Using Qwen 3.6 27b Q4

EDIT: I use MTP on mainline ase described above, context is max 4k at good speed on Q4 cache.

submitted by /u/GodComplecs
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA