r/LocalLLaMA · · 1 min read

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Llama benchmark results

model size params backend ngl threads type_k type_v fa test t/s
qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 pp512 977.40 ± 2.02
qwen35moe 35B.A3B Q4_K - Medium 20.81 GiB 34.66 B SYCL 99 1 q8_0 q8_0 1 tg128 70.54 ± 0.12

I've chucked all my notes in an LLM and created an article if you want to recreate the same setup.

I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing.

I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time.

Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more.

Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff.

submitted by /u/Atomynos_Atom
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA