r/LocalLLaMA · · 1 min read

Ornith 35B works reasonably well with Qwen3.6 35B DFlash speculative model

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

I saw a solid 30-40% token gen increase from this:

./llama-server --no-mmap --port 8080 --host 0.0.0.0 -kvu -ts 75,70 \ --alias qwen -hf bartowski/deepreinforce-ai_Ornith-1.0-35B-GGUF:Q8_0 -sm layer -c 255000 -cram 0 \ -ctk f16 -ctv f16 -fa 1 --jinja -t 7 --metrics --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0 \ --presence_penalty 0.0 --repeat-penalty 1.0 --ctx-checkpoints 4 --checkpoint-min-step 1024 \ --chat-template-kwargs '{"preserve_thinking": true}' \ -hfd williamliao/Qwen3.6-35B-A3B-DFlash-GGUF:Q8_0 --spec-draft-n-max 4 --spec-type draft-dflash 

Not completely sure if it's the the best dflash match, but it's good enough (i got a solid 80% acceptance rate at 50k context of javascript code mixed in with random wikipedia tests).

As common with speculative drafting, while you gain speed in token generation you take a solid hit in prompt processing. So this is far from a silver bullet. But might help some of you.

submitted by /u/hurdurdur7
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA