Hugging Face Daily Papers · June 18, 2026 · 5 min read

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and—in the case of WordPiece and rule-based analyzers—failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme- boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson–binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers—the only ones valid for generation—Morpheus attains the lowest bits-per-character (1.425), roughly dou- bles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs. ∼0.32), and uses ∼19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), sur- passing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heav- ier contextual encoders remain ahead—a trade-off we attribute to Morpheus’s root- centric geometry.\n","updatedAt":"2026-06-18T11:32:06.912Z","author":{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","fullname":"Tolga Şakar","name":"dfavenfre-dev","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8440236449241638},"editors":["dfavenfre-dev"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18717","authors":[{"_id":"6a338c9859127a45e2c1c6b9","user":{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","isPro":false,"fullname":"Tolga Şakar","user":"dfavenfre-dev","type":"user","name":"dfavenfre-dev"},"name":"Tolga Şakar","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:26:30.647Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/69edd712065ba33ca8ab73ce/fSxkiSYKaDsZA3BbEpcBD.png","https://cdn-uploads.huggingface.co/production/uploads/69edd712065ba33ca8ab73ce/eFd1_szyr6OMMTvMNhQuE.png"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish","submittedOnDailyBy":{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","isPro":false,"fullname":"Tolga Şakar","user":"dfavenfre-dev","type":"user","name":"dfavenfre-dev"},"summary":"Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.","upvotes":1,"discussionId":"6a338c9859127a45e2c1c6ba","projectPage":"https://huggingface.co/lonewolflab/Morpheus-TR-50K","githubRepo":"https://github.com/lonewolf-rd/TurkishMorpheus","githubRepoAddedBy":"user","ai_summary":"A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.","ai_keywords":["morpheme-boundary model","subword tokenizers","WordPiece","rule-based analyzers","Poisson-binomial dynamic program","soft morpheme memberships","reversible tokenizers","bits-per-character","MorphScore","lexical retrieval","contextual encoders","NER","case/number probing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"69edd913e47072ad65a27e00","name":"lonewolflab","fullname":"Lonewolf Research & Development","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/E9ttN8Npe7dZ9AZTIDS_5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","isPro":false,"fullname":"Tolga Şakar","user":"dfavenfre-dev","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69edd913e47072ad65a27e00","name":"lonewolflab","fullname":"Lonewolf Research & Development","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/E9ttN8Npe7dZ9AZTIDS_5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18717.md","query":{}}">

Papers

arxiv:2606.18717

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Published on Jun 17

· Submitted by

Tolga Şakar on Jun 18

Lonewolf Research & Development

Upvote

Authors:

Tolga Şakar

Abstract

A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.

View arXiv page View PDF Project page GitHub 2 Add to collection

Community

dfavenfre-dev

Paper author Paper submitter about 4 hours ago

Turkish is agglutinative: meaning is carried
by morphemes, yet the subword tokenizers
that drive modern language models split words
by corpus statistics, fragmenting semantically
loaded suffixes and—in the case of WordPiece
and rule-based analyzers—failing to decode
their output back to the original text. This
paper presents Morpheus, a neural morpheme-
boundary model for Turkish that is at once a
lossless, morphology-aware tokenizer and a
word-embedding producer. A differentiable
Poisson–binomial dynamic program turns
per-character boundary probabilities into soft
morpheme memberships during training and
exact segments at inference, with no string
normalization, so decode(encode(w)) = w
holds by construction. Because the model is
neural, the same forward pass that tokenizes
also emits a structured word embedding.
Among reversible tokenizers—the only ones
valid for generation—Morpheus attains the
lowest bits-per-character (1.425), roughly dou-
bles the gold morphological alignment of the
subword family (MorphScore macro-F1 0.61
vs. ∼0.32), and uses ∼19% less GPU memory
than 64K-vocabulary subword tokenizers. As
an embedder, frozen Morpheus vectors lead on
lexical retrieval (root-family MAP 0.85) and
same-root verification (ROC-AUC 1.00), sur-
passing the multilingual retriever BGE-M3 and
BERTurk; on context- and inflection-dependent
tasks (NER, case/number probing) the heav-
ier contextual encoders remain ahead—a
trade-off we attribute to Morpheus’s root-
centric geometry.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2606.18717

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2606.18717 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish

Abstract

Community

Models citing this paper 1

Datasets citing this paper 0

Spaces citing this paper 1

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers