Turkish is agglutinative: meaning is carried<br>by morphemes, yet the subword tokenizers<br>that drive modern language models split words<br>by corpus statistics, fragmenting semantically<br>loaded suffixes and—in the case of WordPiece<br>and rule-based analyzers—failing to decode<br>their output back to the original text. This<br>paper presents Morpheus, a neural morpheme-<br>boundary model for Turkish that is at once a<br>lossless, morphology-aware tokenizer and a<br>word-embedding producer. A differentiable<br>Poisson–binomial dynamic program turns<br>per-character boundary probabilities into soft<br>morpheme memberships during training and<br>exact segments at inference, with no string<br>normalization, so decode(encode(w)) = w<br>holds by construction. Because the model is<br>neural, the same forward pass that tokenizes<br>also emits a structured word embedding.<br>Among reversible tokenizers—the only ones<br>valid for generation—Morpheus attains the<br>lowest bits-per-character (1.425), roughly dou-<br>bles the gold morphological alignment of the<br>subword family (MorphScore macro-F1 0.61<br>vs. ∼0.32), and uses ∼19% less GPU memory<br>than 64K-vocabulary subword tokenizers. As<br>an embedder, frozen Morpheus vectors lead on<br>lexical retrieval (root-family MAP 0.85) and<br>same-root verification (ROC-AUC 1.00), sur-<br>passing the multilingual retriever BGE-M3 and<br>BERTurk; on context- and inflection-dependent<br>tasks (NER, case/number probing) the heav-<br>ier contextual encoders remain ahead—a<br>trade-off we attribute to Morpheus’s root-<br>centric geometry.</p>\n","updatedAt":"2026-06-18T11:32:06.912Z","author":{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","fullname":"Tolga Şakar","name":"dfavenfre-dev","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8440236449241638},"editors":["dfavenfre-dev"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2606.18717","authors":[{"_id":"6a338c9859127a45e2c1c6b9","user":{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","isPro":false,"fullname":"Tolga Şakar","user":"dfavenfre-dev","type":"user","name":"dfavenfre-dev"},"name":"Tolga Şakar","status":"claimed_verified","statusLastChangedAt":"2026-06-18T11:26:30.647Z","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/69edd712065ba33ca8ab73ce/fSxkiSYKaDsZA3BbEpcBD.png","https://cdn-uploads.huggingface.co/production/uploads/69edd712065ba33ca8ab73ce/eFd1_szyr6OMMTvMNhQuE.png"],"publishedAt":"2026-06-17T00:00:00.000Z","submittedOnDailyAt":"2026-06-18T00:00:00.000Z","title":"Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish","submittedOnDailyBy":{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","isPro":false,"fullname":"Tolga Şakar","user":"dfavenfre-dev","type":"user","name":"dfavenfre-dev"},"summary":"Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.","upvotes":1,"discussionId":"6a338c9859127a45e2c1c6ba","projectPage":"https://huggingface.co/lonewolflab/Morpheus-TR-50K","githubRepo":"https://github.com/lonewolf-rd/TurkishMorpheus","githubRepoAddedBy":"user","ai_summary":"A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.","ai_keywords":["morpheme-boundary model","subword tokenizers","WordPiece","rule-based analyzers","Poisson-binomial dynamic program","soft morpheme memberships","reversible tokenizers","bits-per-character","MorphScore","lexical retrieval","contextual encoders","NER","case/number probing"],"ai_summary_model":"Qwen/Qwen2.5-Coder-32B-Instruct","githubStars":2,"organization":{"_id":"69edd913e47072ad65a27e00","name":"lonewolflab","fullname":"Lonewolf Research & Development","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/E9ttN8Npe7dZ9AZTIDS_5.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"69edd712065ba33ca8ab73ce","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/P8_Mps6RfNXNe3VLt_Iw7.jpeg","isPro":false,"fullname":"Tolga Şakar","user":"dfavenfre-dev","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"69edd913e47072ad65a27e00","name":"lonewolflab","fullname":"Lonewolf Research & Development","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/69edd712065ba33ca8ab73ce/E9ttN8Npe7dZ9AZTIDS_5.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2606/2606.18717.md","query":{}}">
Morpheus: A Morphology-Aware Neural Tokenizer and Word Embedder for Turkish
Abstract
A neural morpheme-boundary model for Turkish achieves lossless tokenization and morphology-aware embeddings with improved efficiency and performance over traditional subword methods.
Turkish is agglutinative: meaning is carried by morphemes, yet the subword tokenizers that drive modern language models split words by corpus statistics, fragmenting semantically loaded suffixes and -- in the case of WordPiece and rule-based analyzers -- failing to decode their output back to the original text. This paper presents Morpheus, a neural morpheme-boundary model for Turkish that is at once a lossless, morphology-aware tokenizer and a word-embedding producer. A differentiable Poisson-binomial dynamic program turns per-character boundary probabilities into soft morpheme memberships during training and exact segments at inference, with no string normalization, so decode(encode(w)) = w holds by construction. Because the model is neural, the same forward pass that tokenizes also emits a structured word embedding. Among reversible tokenizers -- the only ones valid for generation -- Morpheus attains the lowest bits-per-character (1.425), roughly doubles the gold morphological alignment of the subword family (MorphScore macro-F1 0.61 vs.\ {sim}0.32), and uses {sim}19% less GPU memory than 64K-vocabulary subword tokenizers. As an embedder, frozen Morpheus vectors lead on lexical retrieval (root-family MAP 0.85) and same-root verification (ROC-AUC 1.00), surpassing the multilingual retriever BGE-M3 and BERTurk; on context- and inflection-dependent tasks (NER, case/number probing) the heavier contextual encoders remain ahead -- a trade-off we attribute to Morpheus's root-centric geometry. Code: https://github.com/lonewolf-rd/TurkishMorpheus; model: https://huggingface.co/lonewolflab/Morpheus-TR-50K; interactive demo: https://huggingface.co/spaces/lonewolflab/morpheus-tr-demo.
Community
Turkish is agglutinative: meaning is carried
by morphemes, yet the subword tokenizers
that drive modern language models split words
by corpus statistics, fragmenting semantically
loaded suffixes and—in the case of WordPiece
and rule-based analyzers—failing to decode
their output back to the original text. This
paper presents Morpheus, a neural morpheme-
boundary model for Turkish that is at once a
lossless, morphology-aware tokenizer and a
word-embedding producer. A differentiable
Poisson–binomial dynamic program turns
per-character boundary probabilities into soft
morpheme memberships during training and
exact segments at inference, with no string
normalization, so decode(encode(w)) = w
holds by construction. Because the model is
neural, the same forward pass that tokenizes
also emits a structured word embedding.
Among reversible tokenizers—the only ones
valid for generation—Morpheus attains the
lowest bits-per-character (1.425), roughly dou-
bles the gold morphological alignment of the
subword family (MorphScore macro-F1 0.61
vs. ∼0.32), and uses ∼19% less GPU memory
than 64K-vocabulary subword tokenizers. As
an embedder, frozen Morpheus vectors lead on
lexical retrieval (root-family MAP 0.85) and
same-root verification (ROC-AUC 1.00), sur-
passing the multilingual retriever BGE-M3 and
BERTurk; on context- and inflection-dependent
tasks (NER, case/number probing) the heav-
ier contextual encoders remain ahead—a
trade-off we attribute to Morpheus’s root-
centric geometry.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2606.18717 in a dataset README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.