r/LocalLLaMA · June 2, 2026 · 1 min read

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro

Llama benchmark results

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	1	q8_0	q8_0	1	pp512	977.40 ± 2.02
qwen35moe 35B.A3B Q4_K - Medium	20.81 GiB	34.66 B	SYCL	99	1	q8_0	q8_0	1	tg128	70.54 ± 0.12

I've chucked all my notes in an LLM and created an article if you want to recreate the same setup.

I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing.

I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time.

Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more.

Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff.

submitted by /u/Atomynos_Atom
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Llama benchmark results

Discussion (0)

More from r/LocalLLaMA