You can run Deepseek 4 flash on mac (M3 Max, 96gb)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| I didn't know this was actually possible until today. Using https://github.com/antirez/ds4#running-models-larger-than-ram Antirez's specific engine + his specific ds4 gguf it literally just runs. You need to pass When running if you have <128gb I think. Seems 64gb and up is reasonable. I also passed: To raise available metal allocation then you can patch the repo itself to increase cache safety which is .70 optionally to try and push how many experts get loaded into vram. Optionally I built a simple menu bar .app daemon so I can just spotlight > run the server. Just took like 20 minutes. Prefill / times: About 11-13tk/s on my M3 Max 96gb. From cold-boot it's about 10s in a empty Jan assistant chat. After that ~3-5s TTFT. Unfortunately larger prefill is frustrating, so I'm unsure if I want to try this with much coding. 36k tokens take about 2 minutes and 30 seconds. But once it's in cache it sustains about the 12tk/s. ---- Anyways, maybe this was common knowledge but I didn't think this was possible.. It's not that much slower than qwen 27b. Unsure how it benchmarks against it but obviously it's much larger. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.