Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
I saw a solid 30-40% token gen increase from this:
./llama-server --no-mmap --port 8080 --host 0.0.0.0 -kvu -ts 75,70 \ --alias qwen -hf bartowski/deepreinforce-ai_Ornith-1.0-35B-GGUF:Q8_0 -sm layer -c 255000 -cram 0 \ -ctk f16 -ctv f16 -fa 1 --jinja -t 7 --metrics --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \ --presence_penalty 0.0 --repeat-penalty 1.0 --ctx-checkpoints 4 --checkpoint-min-step 1024 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -hfd williamliao/Qwen3.6-35B-A3B-DFlash-GGUF:Q8_0 --spec-draft-n-max 4 --spec-type draft-dflash Not completely sure if it's the the best dflash match, but it's good enough (i got a solid 80% acceptance rate at 50k context of javascript code mixed in with random wikipedia tests).
As common with speculative drafting, while you gain speed in token generation you take a solid hit in prompt processing. So this is far from a silver bullet. But might help some of you.
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.