r/LocalLLaMA · · 1 min read

DeepSeek V4 Flash is amazing! (WIP llama.cpp PR #24162)

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

In case you're not aware already, the DeepSeek V4 series is finally getting supported on llama.cpp with this PR!

The PR is at a very early stage right now, so only try it if you're consciously willing to experiment out of curiosity and accept severe stability/performance tradeoffs. It runs very slow (5-6 tps), GPU and FA support need work, etc., but it is reliable-enough already for correctness.

This is my most anticipated model and I had some time to spare, so I ended up downloading the HF model for DS-V4-Flash and quantizing it myself using the PR(Made a custom 3-bit quant to mimic the full-sized model's tensor layout). And wow!

The model perfectly addresses the crucial three pillars for local inference IMO:

  • The model's intelligence is amazing for its size. First time a local model in this size range actually feels comparable to frontier models, and I'm not exaggerating.
  • Fares a lot better against quantization since it's natively an FP4-FP8 hybrid. This is crucial for local deployment and is my primary problem with models like MiniMax M2.7, where I'm not happy even with UD-Q4_K_XL.
  • Incredibly efficient with context window scaling. Consumes way less KV cache size with no flash attention!

Qwen 3.5/3.6 series is also a huge hit amongst the local community since it addresses the three pillars above way better than its competitors. However, I feel the DeepSeek model has levelled it up even further, and I predict it will easily dominate the 80-140GB model space for many more months to come.

Huge shoutout and thanks to fairydreaming for their relentless work on getting DSA implemented, and to am17an and pwilkin for taking this up! Really looking forward to this PR getting merged!

submitted by /u/Lowkey_LokiSN
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA