Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable.
Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L
Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL
Why these models:
Qwen2.5 is still the best model for infill imo. I tried Gemma4 E4B and Qwen3.5 9B/4B and both produce weird suggestions.
This autocomplete model takes ~8GB VRAM using the command below. The speed of suggestions is basically instant.
Qwen3.6 35B-A3B is actually good at agentic coding at Q8 if you give it a good prompt. At Q4 it's not usable tbh and gets lost a lot, but at Q8 it can figure stuff out and actually finish its work correctly. If you don't have a lot of RAM for MoE experts, try Q6_K, but lower quants have noticable quality issues. You probably need 64GB total RAM minimum. I have 96 but with both models running and a bunch of random stuff open (browser, IDE, Teams) I'm at 56GB used.
Because it has 3B active params, it's still fast and fits into the remaining 8GB VRAM.
Commands:
bash llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L \ -ngl 99 --no-mmap --ctx-size 0 -ctk q8_0 -ctv q8_0 \ -np 1 --temp 0.5 --top-p 0.95 --top-k 20 --min-p 0.0 --port 8081
Note: I actually have no idea which hyperparameters to use for Qwen2.5, maybe someone will enlighten me and I'll edit the post.
bash llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL \ --no-mmap --no-mmproj -fitt 0 -ngl 99 --cpu-moe \ -b 2048 -ub 2048 --jinja \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01
llama.cpp autofits the model and I get ~145k context with this command. You can use -ctv q8_0 -ctk q8_0 if you want more context.
35B-A3B speed with this setup:
pp4096 | 2093.93 ± 22.64 tg128 | 35.29 ± 0.48
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.