r/LocalLLaMA · · 3 min read

Qwen3.6 27B and llama.cpp appreciation post

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

To preface, here's my config:

llama-server \ --host 0.0.0.0 \ --port 1235 \ --models-preset %h/Software/models.ini \ --models-max 1 \ --sleep-idle-seconds 3600 \ --timeout 3600 \ --parallel 1 \ --device ROCm0,ROCm1 [*] flash-attn = on jinja = true fit = true ctxcp = 5 offline = true mmproj-offload = false mmap = false ; ... many other models here ... [tp-go-brrr-WORK-CODE] hf = unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q5_K_XL ctx-size = 131072 temp = 0.6 top-p = 0.95 top-k = 20 presence-penalty = 0.0 min-p = 0.00 fitt = 1024,1024,0 spec-type = draft-mtp spec-draft-n-max = 2 chat-template-kwargs = {"preserve_thinking": true} sm = tensor 

And it's been a blast with a minimal Pi config.

I've been running it on two RX 9070 XTs (PCIe 5.0 x8/x8) both powerlimited to ~235W and using it for actual work. Despite the quant being a bit too low for my liking, the speed, smarts and steerability of the result I feel like is the best of what my current setup can offer for my use cases.

I've been doing a long debugging session where I needed the model to analyze interactions between a couple of backend services deployed on 3 separate instances with different configs and avoid a networking complication while doing so.

And yet, despite some roughness showing up at 5 bit, it did all I asked it to without much issue. Given enough control over the situation, its agentic capabilities are crazy. It successfully pinpointed many vague issues down to specific lines of code by adding logging, spinning up services locally, running requests (both local and to remote instances), iterate, and successfully mocking non-important parts to make sure the actually important code stays untouched for reproducibility, all while maintaining insane responsiveness and speed for a dense model. Some examples:

prompt eval time = 845.93 ms / 337 tokens ( 2.51 ms per token, 398.38 tokens per second) eval time = 5863.80 ms / 275 tokens ( 21.32 ms per token, 46.90 tokens per second) total time = 6709.73 ms / 612 tokens draft acceptance rate = 0.83981 ( 173 accepted / 206 generated) prompt eval time = 1429.61 ms / 618 tokens ( 2.31 ms per token, 432.29 tokens per second) eval time = 3862.16 ms / 175 tokens ( 22.07 ms per token, 45.31 tokens per second) total time = 5291.77 ms / 793 tokens draft acceptance rate = 0.80597 ( 108 accepted / 134 generated) prompt eval time = 1275.30 ms / 543 tokens ( 2.35 ms per token, 425.78 tokens per second) eval time = 3287.57 ms / 151 tokens ( 21.77 ms per token, 45.93 tokens per second) total time = 4562.87 ms / 694 tokens draft acceptance rate = 0.82456 ( 94 accepted / 114 generated) prompt eval time = 318.94 ms / 45 tokens ( 7.09 ms per token, 141.09 tokens per second) eval time = 15105.91 ms / 784 tokens ( 19.27 ms per token, 51.90 tokens per second) total time = 15424.84 ms / 829 tokens draft acceptance rate = 0.98859 ( 520 accepted / 526 generated) prompt eval time = 2151.53 ms / 960 tokens ( 2.24 ms per token, 446.19 tokens per second) eval time = 2084.82 ms / 104 tokens ( 20.05 ms per token, 49.88 tokens per second) total time = 4236.35 ms / 1064 tokens draft acceptance rate = 0.94444 ( 68 accepted / 72 generated) 

What's especially important to me is privacy here. I can safely navigate private environments with it without worrying that I'm leaking something to Gemini or alike.

It might not be perfect, but thanks to the high speeds, it's very easy to guide the model in the right direction if it ever starts drifting away.

Can't wait to get my hands on a R9700, or even a couple of them. A higher quant and bigger context are both gonna make it even more usable. Just need to get a new UPS first because my current one already tripped once due to tensor parallelism while I was away, hence the powerlimits 😅

submitted by /u/ABLPHA
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA