r/LocalLLaMA · May 12, 2026 · 1 min read

Local LLM autocomplete + agentic coding on a single 16GB GPU + 64GB RAM

#model-release #agents #gpu #edge

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Like Read original ↗

Today I set up a full coding toolbox on a single RTX 5080 (with RAM offloading) that's actually viable.

Autocomplete: bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L

Agentic: unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL

Why these models:

Qwen2.5 is still the best model for infill imo. I tried Gemma4 E4B and Qwen3.5 9B/4B and both produce weird suggestions.

This autocomplete model takes ~8GB VRAM using the command below. The speed of suggestions is basically instant.

Qwen3.6 35B-A3B is actually good at agentic coding at Q8 if you give it a good prompt. At Q4 it's not usable tbh and gets lost a lot, but at Q8 it can figure stuff out and actually finish its work correctly. If you don't have a lot of RAM for MoE experts, try Q6_K, but lower quants have noticable quality issues. You probably need 64GB total RAM minimum. I have 96 but with both models running and a bunch of random stuff open (browser, IDE, Teams) I'm at 56GB used.

Because it has 3B active params, it's still fast and fits into the remaining 8GB VRAM.

Commands:

bash llama-server -hf bartowski/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K_L \ -ngl 99 --no-mmap --ctx-size 0 -ctk q8_0 -ctv q8_0 \ -np 1 --temp 0.5 --top-p 0.95 --top-k 20 --min-p 0.0 --port 8081

Note: I actually have no idea which hyperparameters to use for Qwen2.5, maybe someone will enlighten me and I'll edit the post.

bash llama-server -hf unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q8_K_XL \ --no-mmap --no-mmproj -fitt 0 -ngl 99 --cpu-moe \ -b 2048 -ub 2048 --jinja \ --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.01

llama.cpp autofits the model and I get ~145k context with this command. You can use -ctv q8_0 -ctk q8_0 if you want more context.

35B-A3B speed with this setup:

pp4096 | 2093.93 ± 22.64 tg128 | 35.29 ± 0.48

submitted by /u/grumd
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Why these models:

Commands:

Discussion (0)

More from r/LocalLLaMA