r/LocalLLaMA · · 3 min read

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Inferencing at 10.33 t/s on Qwen 3.5 35B on a $300 laptop

https://preview.redd.it/u8062juegq3h1.png?width=1919&format=png&auto=webp&s=a213f6929c6cad58e92bc1681dac9f0545b04d13

Overview:

As the market for consumer computing parts becomes more scarce due to the AI boom, finding ways to use lower-end hardware for less-demanding applications of AI can be highly beneficial. This is an ongoing project of mine to push the limits of a standard laptop on pure cpu/ram inference in highly favorable conditions.

Hardware:

- Lenovo Ideapad Slim 3i 2023 (Best buy, ~$300 at time of purchase)

- 12th Gen Intel© Core™ i3-1215U × 6

- 8gb RAM soldered-on (Flex mode)

- 32gb DDR4 Laptop Ram Expansion

- Linux Mint

Model:

- Qwen 3.5 heretic tune MTP at Q4_K_S

Link : https://huggingface.co/llmfan46/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved

Inference Backend:

Ik_llama.cpp - version 4509 (40aae0b6)

built with cc (Ubuntu 13.3.0-6ubuntu2~24.04.1) 13.3.0 for x86_64-linux-gnu

Sampler Parameters (From Qwen 3.5 model card for general tasks, thinking):

Temperature: 1.0

top_p: 0.95

top_k: 20

min_p: 0.0

presence_penalty: 1.5

repetition_penalty: 1.0

Optimizations:

- Bios -> Battery -> Extreme performance mode

- Bios -> Quiet mode for fan (off)

- Latest ik_llama.cpp build (for better cpu performance)

- In-OS battery mode set to performance

- Fresh system restart

- Laptop set on cool flat surface

- Core pinning (Performance cores only) cores 0 and 2.

- Q4_K_S quantization, 35B MoE, with only 3b active params

- Batch size 64 (Tests did not show a massive difference, but more testing is needed. It doesn't seem to hurt.)

- Speculative Decoding Type MTP

- Draft Max 3

- Quantize K and V cache to Q8_0

- Flash Attention (Suggested by Claude, but found was enabled by default)

- Fmoe (Suggested by Claude, but found was enabled by default)

- rtr (Suggested by Claude, but found was enabled by default)

Testing Setup:

To properly test this setup, the OS was fully restarted, and the ik_llama.cpp engine was initialized using this command.

taskset -c 0,2 ./build/bin/llama-cli

-m "/home/default/LLM Models/Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_S.gguf"

-p "User: Please explain the history of france \nAI:"

-n 1028

--spec-type mtp

--draft-max 3

-t 2

-ub 64

--temp 1.0

--top-p 0.95

--top-k 20

--min-p 0.0

--presence-penalty 1.5

--repeat-penalty 1.0

Results (On a sample of 1028 tokens)

Prompt Eval: 22.49 t/s

T/s Inference Speed : 10:33 t/s

Observations:

The model itself seemed to run much faster than other models of similar size. This is possibly due to architectural choices made for the Qwen 3.5 line of models, particularly for the 35b. Testing similar settings with Gemma 4 26b a4b ~Q4 yielded much slower results, in the ballpark of ~3t/s despite only having +25% more active parameters.

During generation, the thermals hovered just under their limit, at 90C during generation. Previously, when using llama.cpp, all cores were capped at 17.5W to avoid thermal overheating and subsequent throttling, but found that no wattage cap was needed when using ik_llama. This may possibly be due to ik_llama.cpp having better cpu efficiency is a possibility, though may attributed to an external unseen variable.

Potential Future Optimizations:

- Manual Configuration of XMP Memory Timings, which requires the flashing of a custom BIOS. (Possibly +10% inference t/s)

- Thermal Repasting with higher-end paste to better control thermals.

- Switching from DDR4 Laptop RAM to DDR5. (Combined with thermal paste upgrade, potentially a rough gain of +20% inference t/s.

submitted by /u/OcelotOk8071
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA