650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Disclosure first: I maintain OpenMed, so read this with that bias. I'm posting the numbers with the full methodology and a runnable script so you can reproduce or tear it apart. I'm here for the next couple of hours to answer methodology questions. What it is: an open-source clinical/biomedical NER project. 1,000+ models on Hugging Face, all Apache 2.0, and the What's new: 410 new MLX builds, bringing it to 650+ total. They run on macOS via MLX and on iPhone/iPad via OpenMedKit (open Swift package). The NER paper is arXiv 2508.01630 (SOTA across 12 public datasets, per-dataset tables inside, judge them yourself). On-device speed, methodology first. Same model, MLX on Apple Silicon vs PyTorch on CPU, same fp32 precision, byte-identical entity outputs (parity-checked). On a 3-year-old MacBook Pro M3 Max, the clinical NER models run 30-40x faster on MLX: a 434M biomedical NER is 27 ms (MLX) vs ~1080 ms (CPU) at fp32, same weights, identical entities. The reason is architectural, not a precision trick: these are deberta-v2 models whose disentangled attention is O(n^2) and very slow on CPU, while the Apple GPU handles it easily. It is input- and model-dependent, so a smaller model on short text is single-digit-x, not 30x. The second clip in the video is the PII de-identification model redacting on-device; the point there is privacy, identifiers are stripped locally and nothing leaves the machine.
iPhone note: I'm not claiming 36 ms is a phone number, it's the M3 Max. The phone story is "these run via OpenMedKit". Everything's public: models (Apache 2.0 HF), SDK (Apache 2.0 GitHub), paper (arXiv 2508.01630). Ask me anything on the parity check, the dtype story, or the dataset numbers. [link] [comments] |
More from r/LocalLLaMA
-
Been running Qwen3.6-27B through a 3-critic harness. The harness matters more than I thought
Jun 30
-
I Hate Dario Amodei, and everything he stands for.
Jun 29
-
Introducing LongCat-2.0 - , a large-scale MoE language model with 1.6 trillion total parameters and ~48 billion activated per token. This was the stealth model that was on Openrouter under the name 'owl-alpha'.
Jun 29
-
Krea-2-Turbo Image Model - Easy to be fully uncensored, but it can also EDIT Images!
Jun 29
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.