Meddies PII: An Open Multilingual De-identification Model for Clinical Text
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
A clinical AI model does not need to know who the patient is to reason clinically.
It needs the symptoms, medications, lab results, diagnosis history, and treatment course.
The problem is that in real medical records, those facts usually sit next to identifiers: names, record IDs, insurance numbers, addresses, phone numbers, admission dates, department names.
So clinical de-identification has a double contract:
1. Do not let patient identifiers leak.
2. Do not destroy the clinical facts that still need to be used.
That second part is easy to underestimate.
If a model misses a date of birth, the privacy boundary fails. If it removes
"creatinine 86 µmol/L" or "metformin 500 mg," the downstream clinical record loses meaning. Both are failures, but they have different consequences.
We built Meddies PII for this problem. It is an open research model and dataset for multilingual clinical de-identification. The dataset is synthetic and built with dynamic prompting, varying language, document type, document label, note length, text format, edge case, and identifier family across generations.
The goal is not one pretty template. The goal is stable extraction behavior across the messy surfaces hospital data actually appears in: rushed notes, nursing forms, JSON/XML exports, multilingual text, administrative records, and chat-style prompts.
Meddies PII is not a complete de-identification product. Hospitals still need policy, audit logs, local validation, human escalation paths, and deployment controls.
But we think this is a useful starting point: open enough to inspect, careful enough to discuss honestly, and built from the reality that clinical AI needs more than benchmark performance to be deployable.
Full post: https://meddies.ai/research/meddies-pii
Demo: https://huggingface.co/spaces/Meddies/meddies-pii-extractor
Model: https://huggingface.co/Meddies/meddies-pii
Dataset: https://huggingface.co/datasets/Meddies/meddies-pii
[link] [comments]
More from r/LocalLLaMA
-
what’s was your local daily driver for coding last week?
Jun 8
-
kv-cache : avoid kv cells copies by ggerganov · Pull Request #24277 · ggml-org/llama.cpp
Jun 8
-
[Benchmark] DFlash Speculative Decoding + KV Cache Compression on RTX 5090 — 3.26x Speedup
Jun 8
-
mindlab-research/Macaron-V1-Preview-749B • Huggingface
Jun 8
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.