r/LocalLLaMA · June 23, 2026 · 2 min read

650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

650+ Apache-2.0 biomedical NER/de-id models that run on-device in MLX. Same fp32 weights, identical outputs: the clinical NER models run 30-40x faster than PyTorch-CPU on a 3-year-old M3 Max. Repro inside.

Disclosure first: I maintain OpenMed, so read this with that bias. I'm posting the numbers with the full methodology and a runnable script so you can reproduce or tear it apart. I'm here for the next couple of hours to answer methodology questions.

What it is: an open-source clinical/biomedical NER project. 1,000+ models on Hugging Face, all Apache 2.0, and the openmed Python SDK is Apache 2.0. These are extraction tools, not diagnostic tools: multi-entity biomedical NER (genes, chemicals, cancers, cells, organisms), disease NER, drug NER, and multilingual PII de-identification. No diagnosis, no clinical decision support. Everything referenced here is open, Apache 2.0.

What's new: 410 new MLX builds, bringing it to 650+ total. They run on macOS via MLX and on iPhone/iPad via OpenMedKit (open Swift package). The NER paper is arXiv 2508.01630 (SOTA across 12 public datasets, per-dataset tables inside, judge them yourself).

On-device speed, methodology first. Same model, MLX on Apple Silicon vs PyTorch on CPU, same fp32 precision, byte-identical entity outputs (parity-checked). On a 3-year-old MacBook Pro M3 Max, the clinical NER models run 30-40x faster on MLX: a 434M biomedical NER is 27 ms (MLX) vs ~1080 ms (CPU) at fp32, same weights, identical entities. The reason is architectural, not a precision trick: these are deberta-v2 models whose disentangled attention is O(n^2) and very slow on CPU, while the Apple GPU handles it easily. It is input- and model-dependent, so a smaller model on short text is single-digit-x, not 30x. The second clip in the video is the PII de-identification model redacting on-device; the point there is privacy, identifiers are stripped locally and nothing leaves the machine.

434M biomedical NER: 36 ms MLX vs 1248 ms PyTorch-CPU-bf16
434M PII de-id: 46 ms MLX vs 1671 ms PyTorch-CPU-bf16

import time, statistics, torch from openmed.core.backends import get_backend from openmed.core.config import OpenMedConfig from openmed.mlx.inference import _download_preconverted_mlx_model, create_mlx_pipeline MODEL = "OpenMed/OpenMed-NER-OncologyDetect-SuperClinical-434M" text = ("Metastatic non-small cell lung carcinoma. EGFR exon 19 deletion, KRAS G12C, " "wild-type TP53/BRAF. Cisplatin, pemetrexed, then osimertinib; sotorasib held. " "Xenografts in Mus musculus mirrored Homo sapiens organoids on carboplatin.") mlx = create_mlx_pipeline(_download_preconverted_mlx_model(MODEL + "-mlx"), aggregation_strategy="simple") cpu = get_backend("hf", config=OpenMedConfig(device="cpu")).create_pipeline( MODEL, task="token-classification", aggregation_strategy="simple", torch_dtype=torch.float32) def med(p): p(text) # warmup ts = [(_t := time.perf_counter(), p(text), (time.perf_counter()-_t)*1000)[2] for _ in range(7)] return statistics.median(ts) print(f"MLX {med(mlx):.0f} ms | CPU fp32 {med(cpu):.0f} ms") # ~27 ms | ~1080 ms -> ~40x, identical entities

iPhone note: I'm not claiming 36 ms is a phone number, it's the M3 Max. The phone story is "these run via OpenMedKit".

Everything's public: models (Apache 2.0 HF), SDK (Apache 2.0 GitHub), paper (arXiv 2508.01630). Ask me anything on the parity check, the dtype story, or the dataset numbers.

submitted by /u/dark-night-rises
[link] [comments]

Discussion (0)

No comments yet. Sign in and be the first to say something.

Discussion (0)

More from r/LocalLLaMA