This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built through cross-lingual tokenizer surgery and offline embedding distillation. Instead of expensive full pretraining, we adapt a multilingual embedding model by constructing a Turkish-optimized 131k vocabulary tokenizer, cloning the teacher architecture with a compatible embedding table, and distilling from precomputed teacher vectors.</p>\n<p>The resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.</p>\n<p>All artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.</p>\n","updatedAt":"2026-06-02T07:31:22.750Z","author":{"_id":"65eed2687fc3ae807890acb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg","fullname":"Ali Bayram","name":"alibayram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":205,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8605669140815735},"editors":["alibayram"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29992","authors":[{"_id":"6a1e8234808ddbc3c7d43f0c","name":"M. Ali Bayram","hidden":false},{"_id":"6a1e8234808ddbc3c7d43f0d","name":"Banu Diri","hidden":false},{"_id":"6a1e8234808ddbc3c7d43f0e","name":"Savaş Yıldırım","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65eed2687fc3ae807890acb1/85bvR-dmlV2TM7CcHVTRo.png"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation","submittedOnDailyBy":{"_id":"65eed2687fc3ae807890acb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg","isPro":false,"fullname":"Ali Bayram","user":"alibayram","type":"user","name":"alibayram"},"summary":"Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.","upvotes":2,"discussionId":"6a1e8234808ddbc3c7d43f0f","projectPage":"https://huggingface.co/spaces/magibu/embeddingmagibu-200m","githubRepo":"https://github.com/malibayram/embedding-trainer","githubRepoAddedBy":"user","ai_summary":"A Turkish-focused sentence embedding model is developed through efficient adaptation techniques, achieving superior performance with reduced computational costs compared to larger teacher models.","ai_keywords":["sentence embeddings","multilingual tokenizer","transformer backbone","embedding distillation","cosine similarity objective","teacher-student model","L2-normalized vectors","context window","parameter-efficient fine-tuning"],"githubStars":1,"organization":{"_id":"6691476a93268c7af00bbdbe","name":"magibu","fullname":"magibu","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eed2687fc3ae807890acb1/xlSb9PbwVRWXb-osb9-Rp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65eed2687fc3ae807890acb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg","isPro":false,"fullname":"Ali Bayram","user":"alibayram","type":"user"},{"_id":"6a15cba22c00c175ed3443f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Hn0SiQjD0yoCjJ1ND4VtI.png","isPro":false,"fullname":"清水湊","user":"ihernandez158","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6691476a93268c7af00bbdbe","name":"magibu","fullname":"magibu","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eed2687fc3ae807890acb1/xlSb9PbwVRWXb-osb9-Rp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29992.md"}">
Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation
Abstract
A Turkish-focused sentence embedding model is developed through efficient adaptation techniques, achieving superior performance with reduced computational costs compared to larger teacher models.
AI-generated summary
Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.
Community
This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built through cross-lingual tokenizer surgery and offline embedding distillation. Instead of expensive full pretraining, we adapt a multilingual embedding model by constructing a Turkish-optimized 131k vocabulary tokenizer, cloning the teacher architecture with a compatible embedding table, and distilling from precomputed teacher vectors.
The resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.
All artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.