Qwen 3.6-35B-A3B with 977 tk/s prompt processing and 262k context window on Intel Arc B70 Pro
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Llama benchmark results
I've chucked all my notes in an LLM and created an article if you want to recreate the same setup. I am currently using this with oh my pi and its very usable. I was able to create a well-designed poker game without it going in a loop or hanging/crashing. I've also tried intels vllm before but couldn't get it to this kind of performance for a single request, I see that there are some updates, so I will give that another shot when I have the time. Would love to hear if anyone's running a similar setup with any optimizations I'm missing, or anything in there that's actually doing nothing? Always looking to squeeze out more. Also massive thanks to the llama.cpp contributors and everyone working to make local inferencing viable. The fact that I can do this kind of inferencing locally is only possible because of the people building and maintaining this stuff. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.