Hugging Face Daily Papers · June 2, 2026 · 5 min read

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model built through cross-lingual tokenizer surgery and offline embedding distillation. Instead of expensive full pretraining, we adapt a multilingual embedding model by constructing a Turkish-optimized 131k vocabulary tokenizer, cloning the teacher architecture with a compatible embedding table, and distilling from precomputed teacher vectors.\nThe resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.\nAll artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.\n","updatedAt":"2026-06-02T07:31:22.750Z","author":{"_id":"65eed2687fc3ae807890acb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg","fullname":"Ali Bayram","name":"alibayram","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":205,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8605669140815735},"editors":["alibayram"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.29992","authors":[{"_id":"6a1e8234808ddbc3c7d43f0c","name":"M. Ali Bayram","hidden":false},{"_id":"6a1e8234808ddbc3c7d43f0d","name":"Banu Diri","hidden":false},{"_id":"6a1e8234808ddbc3c7d43f0e","name":"Savaş Yıldırım","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65eed2687fc3ae807890acb1/85bvR-dmlV2TM7CcHVTRo.png"],"publishedAt":"2026-05-28T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation","submittedOnDailyBy":{"_id":"65eed2687fc3ae807890acb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg","isPro":false,"fullname":"Ali Bayram","user":"alibayram","type":"user","name":"alibayram"},"summary":"Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.","upvotes":2,"discussionId":"6a1e8234808ddbc3c7d43f0f","projectPage":"https://huggingface.co/spaces/magibu/embeddingmagibu-200m","githubRepo":"https://github.com/malibayram/embedding-trainer","githubRepoAddedBy":"user","ai_summary":"A Turkish-focused sentence embedding model is developed through efficient adaptation techniques, achieving superior performance with reduced computational costs compared to larger teacher models.","ai_keywords":["sentence embeddings","multilingual tokenizer","transformer backbone","embedding distillation","cosine similarity objective","teacher-student model","L2-normalized vectors","context window","parameter-efficient fine-tuning"],"githubStars":1,"organization":{"_id":"6691476a93268c7af00bbdbe","name":"magibu","fullname":"magibu","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eed2687fc3ae807890acb1/xlSb9PbwVRWXb-osb9-Rp.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65eed2687fc3ae807890acb1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/x6vfeIWdzDuB8K4K6nZmm.jpeg","isPro":false,"fullname":"Ali Bayram","user":"alibayram","type":"user"},{"_id":"6a15cba22c00c175ed3443f0","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/Hn0SiQjD0yoCjJ1ND4VtI.png","isPro":false,"fullname":"清水湊","user":"ihernandez158","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"6691476a93268c7af00bbdbe","name":"magibu","fullname":"magibu","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65eed2687fc3ae807890acb1/xlSb9PbwVRWXb-osb9-Rp.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.29992.md"}">

Papers

arxiv:2605.29992

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Published on May 28

· Submitted by

Ali Bayram on Jun 2

magibu

Upvote

Authors:

Abstract

A Turkish-focused sentence embedding model is developed through efficient adaptation techniques, achieving superior performance with reduced computational costs compared to larger teacher models.

AI-generated summary

Sentence embeddings are a foundational component for semantic search, clustering, classification, and retrieval-augmented generation. This paper presents embeddingmagibu-200m, a Turkish-focused sentence embedding model that produces 768-dimensional L2-normalized vectors and supports an 8,192-token context window, far exceeding the 512-token limit of earlier BERT-based Turkish encoders. Instead of full pretraining, an efficient three-stage adaptation pipeline is introduced: (1) construct a Turkish-optimized multilingual tokenizer with a 131,072 vocabulary by pruning redundant tokens from the teacher's vocabulary and incorporating multilingual tokens via frequency analysis on a 40-language corpus, (2) clone a teacher embedding model while preserving transformer backbone weights and initializing a compatible embedding table for the new vocabulary via mean-composition token mapping, and (3) perform offline embedding distillation from precomputed teacher vectors using a cosine similarity objective over a balanced 40-language Wikipedia corpus. The resulting student model contains approximately 200M parameters and trains in roughly four hours on a single GPU by avoiding online teacher inference during training, at a total cost of 5-20. Empirically, Pearson/Spearman correlations of 77.55%/77.45% are obtained on STSbTR, surpassing the 300M-parameter teacher model (73.84%/72.92%). On TR-MTEB (26 tasks), a mean score of 63.9% is achieved (7th out of 26 models), providing a competitive cost-quality trade-off with 33% fewer parameters than the teacher. To facilitate reproducibility and downstream use, all artifacts are released including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling.

View arXiv page View PDF Project page GitHub 1 Add to collection

Community

alibayram

Paper submitter about 3 hours ago

The resulting 200M-parameter model supports an 8,192-token context window and achieves 77.55% Pearson / 77.45% Spearman on STSbTR, outperforming the 300M-parameter teacher model. On TR-MTEB, it reaches a 63.9% mean score, ranking 7th among 26 models while offering a strong cost-quality trade-off.

All artifacts are released, including model weights, tokenizer files, precomputed embedding datasets, and open-source cloning and distillation tooling. The work is relevant for Turkish NLP, low-resource language adaptation, sentence embeddings, semantic search, RAG, tokenizer optimization, and efficient distillation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.29992

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 1

Discussion (0)

No comments yet. Sign in and be the first to say something.

Adapting Multilingual Embedding Models to Turkish via Cross-Lingual Tokenizer Surgery and Offline Distillation

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 2

Collections including this paper 1

Discussion (0)

More from Hugging Face Daily Papers