Hugging Face · May 14, 2026 · 17 min read

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

#rag #open-source

Mirrored from Hugging Face for archival readability. Support the source by reading on the original site.

Like Read original ↗

Back to Articles

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

Enterprise Article Published May 14, 2026

TL;DR: Two new Apache 2.0 multilingual embedding models built on ModernBERT — a 97M-parameter compact model that beats every open sub-100M multilingual embedder on MTEB Multilingual Retrieval (60.3), and a 311M full-size model that scores 65.2 on MTEB Multilingual Retrieval (#2 among open models under 500M parameters) with Matryoshka support. Both cover 200+ languages, are tuned on 52 languages, handle 32K-token context (64x R1), and add code retrieval across 9 programming languages.

In this post: Enterprise-Ready by Design · A Strong Sub-100M Multilingual Model · What Changed from R1 · Training the Full-Size 311M Model · Building the compact 97M Multilingual Model · Benchmark Results · Matryoshka Embeddings · Deployment Options · For Framework Integrators · Which Model Should You Use? · Try The Models

Multilingual embedding models face a persistent tension: broad language coverage usually comes at the cost of model size, and small models usually sacrifice languages. If you work across languages — retrieval-augmented generation over multilingual corpora, cross-lingual search, code retrieval in international teams — you've likely had to choose between a model that's fast enough and one that's good enough.

The Granite Embedding Multilingual R2 release narrows that gap considerably. We're releasing two new multilingual embedding models:

granite-embedding-311m-multilingual-r2 — A 311M-parameter full-size model with 768-dimensional embeddings, Matryoshka dimension support, and top-tier multilingual retrieval quality.
granite-embedding-97m-multilingual-r2 — A 97M-parameter compact model with 384-dimensional embeddings that delivers strong retrieval quality for its size.

Both models support 200+ languages with enhanced retrieval quality for 52 languages and programming code, handle context lengths up to 32,768 tokens (a 64x increase over their R1 predecessors), and are released under the Apache 2.0 license. They work out of the box with sentence-transformers and transformers, require no task-specific instructions, and are compatible as drop-in replacements in LangChain, LlamaIndex, Haystack, and Milvus with a one-line model name change. For frameworks currently using an English-only default, that one line gives every user in your community support for 200+ languages — no API changes, no new dependencies, no code changes required on their end. Both models ship with ONNX and OpenVINO weights for CPU-optimized inference.

52 enhanced-support languages (click to expand)

The underlying encoder was pretrained on text from 200+ languages, producing general-purpose embeddings for any of them. The following 52 languages receive explicit retrieval-pair and cross-lingual training for higher-quality retrieval:

Albanian (sq), Arabic (ar), Azerbaijani (az), Bengali (bn), Bulgarian (bg), Catalan (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Estonian (et), Finnish (fi), French (fr), Georgian (ka), German (de), Greek (el), Hebrew (he), Hindi (hi), Hungarian (hu), Icelandic (is), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Khmer (km), Korean (ko), Latvian (lv), Lithuanian (lt), Malay (ms), Marathi (mr), Norwegian (no), Persian (fa), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovenian (sl), Spanish (es), Swahili (sw), Swedish (sv), Tagalog (tl), Telugu (te), Thai (th), Turkish (tr), Ukrainian (uk), Urdu (ur), Uzbek (uz), Vietnamese (vi).

Additionally, the models are trained on programming code (Python, Go, Java, JavaScript, PHP, Ruby, SQL, C, C++) and support cross-lingual code retrieval.

Enterprise-Ready by Design

Both embedding models are trained on a mixture of IBM‑curated datasets, publicly available data, and internally generated or synthetic data. Public web‑derived data used in training is selected and filtered using IBM‑developed quality, deduplication, and governance processes intended to reduce risk in downstream commercial use. We intentionally avoid the use of the MS‑MARCO training dataset and datasets with explicit non‑commercial licensing restrictions. The models are pretrained using GneissWeb, an IBM‑curated dataset derived from publicly available web content and processed using IBM’s data preparation and governance tooling—along with additional IBM‑curated and other publicly available sources. Datasets undergo IBM governance review to assess licensing considerations, ownership signals, and personal data risks. These processes are designed to contribute to responsible use and enterprise deployment.

A Strong Sub-100M Multilingual Model

The standout of this release is granite-embedding-97m-multilingual-r2. At 97 million parameters, it scores 60.3 on Multilingual MTEB Retrieval across 18 languages — the highest retrieval score we've found for any open multilingual embedding model under 100M parameters. The next-best model in that size class, multilingual-e5-small, scores 50.9 on the same benchmark — a +9.4 point gap on a mature benchmark.

At roughly one-third the size of the 311M full-size model, it retains the majority of its retrieval quality across multilingual, code, and long-document benchmarks — a +12.2 point gain on MTEB Multilingual Retrieval over its direct predecessor, driven by a new architecture, better training data, and a novel pruning methodology (more on that below). The full-size granite-embedding-311m-multilingual-r2 scores 65.2 on the same benchmark, a +13.0 point gain over its R1 predecessor.

What Changed from R1

The Granite Embedding Multilingual R1 models were built on XLM-RoBERTa encoders with a 512-token context window. The R2 generation is a ground-up rebuild:

ModernBERT is a recent encoder architecture that revisits the original BERT design with techniques from the last five years of transformer research. The shift brings several practical benefits: alternating attention lengths reduce computation on long sequences (improves throughput on long sequences significantly), rotary position embeddings allow the 32K context window without the positional interpolation hacks that plague older architectures, and Flash Attention 2.0 support speeds up encoding on modern GPUs.

The new multilingual tokenizers are worth highlighting. Rather than reusing XLM-RoBERTa's 250K-token vocabulary, we adopted existing tokenizers with strong multilingual and code coverage. The 311M model uses the Gemma 3 tokenizer (262K tokens); the 97M model starts from the GPT-OSS tokenizer and prunes it down to a compact 180K-token vocabulary that preserves broad multilingual coverage while reducing the embedding table's parameter footprint. Tokenizer efficiency matters more than people realize — a 32K-token window sounds impressive until your tokenizer burns half of it encoding a single paragraph of Thai.

Training the Full-Size 311M Model

The 311M model is a 22-layer ModernBERT encoder with a 262K-token multilingual vocabulary, trained through a multi-stage pipeline:

Knowledge distillation: The model learns from multiple teacher models simultaneously. The teachers are Granite 3.3 Instruct and Mistral v0.2 Instruct decoder models, further finetuned for text embeddings, which transfer retrieval-specific knowledge into the 311M encoder architecture.
Contrastive fine-tuning: Standard contrastive training on multilingual retrieval pairs — queries matched with relevant and hard-negative passages across 52 languages and code — sharpens the model's ability to distinguish relevant from irrelevant results.
Model merging: After training, we merge checkpoints from different training stages and configurations. This combines the strengths of models optimized for different objectives (e.g., multilingual breadth vs. English depth) into a single set of weights without additional training compute.
Matryoshka Representation Learning: The model is trained with Matryoshka objectives so that its 768-dimensional embeddings can be truncated to 512, 384, 256, or 128 dimensions with minimal quality loss (see Matryoshka Embeddings below).

The result is a model that scores 65.2 on MTEB Multilingual Retrieval and 56.3 on the overall average — a +14.5 point average gain over its R1 predecessor.

Building the compact 97M Multilingual model

The 97M model is trained through a combination of vocabulary selection and knowledge distillation:

Vocabulary selection: The 262K-token vocabulary is reduced to a purpose-trained 180K-token vocabulary that preserves broad multilingual coverage while cutting the embedding table size substantially.
Knowledge distillation: The pruned model is then finetuned using knowledge distillation from multiple teacher models (including a Granite 4.1 8B and Mistral Instruct decoder-based teacher) and contrastive training to improve retrieval quality.

This approach transfers retrieval-specific knowledge from multiple strong teachers, while reducing the model parameters without sacrificing language coverage. The result is a highly efficient compact model — scoring 60.3 on MTEB Multilingual Retrieval vs. 65.2 for the full-size model, while being approximately 3x smaller.

Benchmark Results

Multilingual Retrieval

Performance across the main benchmark suite sorted by model size. Scores are averages across tasks within each benchmark (higher is better):

Model	Params	Active Params	Embed Dim	MTEB Multilingual Retrieval (18)	Code (12)	English Retrieval (10)	LongEmbed (6)	RaR-b (17)
F2LLM-v2-80M	80M	32M	320	50.1	68.0	47.5	31.7	17.9
multilingual-e5-small	118M	22M	384	50.9	53.5	46.5	38.8	20.3
granite-embedding-107m-multilingual (R1)	107M	11M	384	48.1	40.7	47.9	34.3	17.1
paraphrase-multilingual-MiniLM-L12-v2	118M	22M	384	36.6	23.5	35.9	20.9	10.9
jina-embeddings-v5-text-nano	212M	113M	768	63.3	71.2	58.8	63.6	25.2
harrier-oss-v1-270m	268M	100M	640	66.4	62.4	52.1	64.9	32.9
multilingual-e5-base	278M	86M	768	52.7	52.6	49.0	40.5	23.4
granite-embedding-278m-multilingual (R1)	278M	86M	768	52.2	48.5	51.5	37.7	18.9
embeddinggemma-300m	308M	106M	768	62.5	68.7	54.6	55.4	26.1
gte-multilingual-base	305M	113M	768	57.2	57.5	50.8	62.1	19.0
snowflake-arctic-embed-m-v2.0	305M	113M	768	54.8	55.2	58.4	55.4	23.3
multilingual-e5-large	560M	304M	1024	53.7	55.8	51.5	40.4	25.4
text-embedding-3-small (OpenAI, API only)	—	—	1536	50.7	—	53.8	53.6	23.2

granite-embedding-97m-multilingual-r2	97M	28M	384	60.3	60.4	50.1	65.6	24.9
granite-embedding-311m-multilingual-r2	311M	110M	768	65.2 (#2)	63.8 (#3)	52.6 (#5)	71.7 (#1)	28.0 (#2)

A few things stand out:

The 97M R2 model beats multilingual-e5-base and gte-multilingual-base (~300M parameter models) on average and on most individual benchmarks, despite being roughly 3x smaller.
paraphrase-multilingual-MiniLM-L12-v2 — a widely-used framework default — scores 36.6, a full +23.7 points behind the 97M R2 model, which is also slightly smaller (97M vs 110M parameters) with the same 384-dimensional output.
LongEmbed is the biggest R1-to-R2 gain: +31.3 points for the 97M model, +34.0 for the 311M. This is the direct payoff of the 32K context window — R1's 512-token limit meant your legal contract was being judged by its first page. Many practical multilingual workloads involve long documents (legal contracts, technical manuals, research papers, multi-page reports) that R1 simply could not see in full.
Code retrieval improves dramatically: +19.7 (97M) and +15.3 (311M) over R1, reflecting the new code training set, larger context window, and better training methodology.
In the broader competitive field, harrier-oss-v1-270m leads on MTEB Multilingual Retrieval (66.4) and RaR-b (32.9), while jina-embeddings-v5-text-nano leads on Code (71.2) and English Retrieval (58.8). The 311M Granite model is competitive on average (56.3) and leads on LongEmbed (71.7), while offering substantially higher encoding throughput than jina-embeddings-v5-text-nano (see speed table below).

Speed and Throughput

Encoding speed matters for production workloads, especially when you're indexing millions of documents or need low-latency query encoding. We measured latency and throughput on a single NVIDIA H100 GPU using 512-token chunks:

The 97M model encodes over 2,500 documents per second — comparable throughput to multilingual-e5-small — while delivering substantially higher retrieval quality. The 311M model, at ~1,800 docs/sec, performs better than jina-embeddings-v5-text-nano on retrieval quality (65.2 vs. 63.3) at over 5.5x the encoding speed (note: speed numbers are computed with the latest transformer code, which had a speed regression vs the last 4.57 version - for both the Jina and granite models - see our technical report for details). harrier-oss-v1-270m offers the best combination of speed and retrieval score among the competitors listed here.

Matryoshka Embeddings (311M)

The 311M model supports Matryoshka Representation Learning, which lets you truncate embeddings from the full 768 dimensions down to 512, 384, 256, or 128 with graceful quality degradation. This is useful when storage, memory, or similarity-computation cost is a concern — a 256-dimensional embedding takes one-third the storage of a 768-dimensional one, and cosine similarity is proportionally cheaper to compute.

Here's how retrieval quality holds up across embedding dimensions:

The quality loss from dimension reduction is remarkably small. Cutting from 768 to 256 dimensions — a 3x reduction in storage and similarity-computation cost — drops MTEB Multilingual Retrieval by just 0.5 points (65.2 → 64.7) and Code Retrieval by 0.5 points (63.9 → 63.4). Even at 128 dimensions (a 6x reduction), the model still scores 63.7 on MTEB Multilingual Retrieval and 62.3 on Code — retaining over 97% of its full-dimension performance. In practice, this means you can substantially reduce your index size and search latency with minimal impact on result quality. (Note,results in the above picture were evaluated with a context length of 1024 for English and Multilingual Retrieval, and 8192 for Code).

For comparison, the 311M model truncated to 384 dimensions (the same dimensionality as the 97M model's native output) still outperforms the 97M model across all three benchmarks. If you need 384-dimensional embeddings and can afford the 311M model's encoding cost, Matryoshka truncation is the stronger option.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ibm-granite/granite-embedding-311m-multilingual-r2")

# Full 768-dimensional embeddings
full = model.encode(["example text"])
print(full.shape)  # (1, 768)

# Truncated to 384 dimensions
small = model.encode(["example text"], truncate_dim=384)
print(small.shape)  # (1, 384)

The 97M model does not support Matryoshka — 384 dimensions is already compact.

Cross-lingual Retrieval

Average performance on cross-lingual tasks within MTEB Retrieval. Belebele measures cross-lingual passage matching across 122 languages; MLQA measures extractive cross-lingual question answering retrieval across 7 languages.

Model	Belebele Retrieval	MLQA Retrieval
granite-embedding-107m-multilingual (R1)	55.1	60.5
granite-embedding-278m-multilingual (R1)	62.2	63.0
granite-embedding-97m-multilingual-r2	52.9	60.5
granite-embedding-311m-multilingual-r2	66.5	67.1

The 311M R2 model gains +4.3 on Belebele and +4.1 on MLQA over its R1 predecessor, showing improved cross-lingual transfer at the larger scale across both benchmarks.

The 97M R2 model scores lower on Belebele (52.9 vs 55.1, −2.2) while matching its R1 predecessor on MLQA (60.5). The Belebele gap is a tradeoff inherent in the pruning and vocabulary reduction process — the R2 model's training prioritized the broader 18-language MTEB Multilingual Retrieval set (where it gains +12.2 over R1) and long-document retrieval (+31.3), while the smaller vocabulary (180K vs. 250K tokens) and reduced layer count (12 vs. 22) affect narrow cross-lingual transfer tasks. If cross-lingual transfer across many language pairs is your primary use case, the full-size 311M model is the better choice.

Deployment Options

Both models ship with multiple deployment paths for production use. Install the core library with:

pip install sentence-transformers

Sentence Transformers (recommended for most users):

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer("ibm-granite/granite-embedding-97m-multilingual-r2")

queries = [
    "What is the tallest mountain in Japan?",          # English
    "Wer hat das Lied Achy Breaky Heart geschrieben?", # German
    "ドイツの首都はどこですか？",                            # Japanese
]

passages = [
    "富士山は、静岡県と山梨県にまたがる活火山で、標高3776.12 mで日本最高峰の独立峰である。",  # Japanese
    "Achy Breaky Heart is a country song written by Don Von Tress.",                        # English
    "Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.",                # German
]

q_emb = model.encode(queries)
p_emb = model.encode(passages)
print(util.cos_sim(q_emb, p_emb))
# Each query scores highest against its matching passage — across languages

LangChain (pip install langchain-huggingface):

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="ibm-granite/granite-embedding-97m-multilingual-r2"
)

docs = embeddings.embed_documents([
    "富士山は日本最高峰の独立峰です。",
    "Mount Fuji is Japan's highest peak.",
])
query = embeddings.embed_query("What is Japan's tallest mountain?")
# Drop-in replacement anywhere LangChain accepts an Embeddings object

LlamaIndex (pip install llama-index-embeddings-huggingface):

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

embed_model = HuggingFaceEmbedding(
    model_name="ibm-granite/granite-embedding-97m-multilingual-r2"
)
Settings.embed_model = embed_model  # applies globally to any index or pipeline

Haystack (pip install sentence-transformers haystack-ai)

from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder,
)
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack.dataclasses import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

doc_embedder = SentenceTransformersDocumentEmbedder(
    model="ibm-granite/granite-embedding-97m-multilingual-r2"
)
query_embedder = SentenceTransformersTextEmbedder(
    model="ibm-granite/granite-embedding-97m-multilingual-r2"
)
doc_embedder.warm_up()
query_embedder.warm_up()

# Embed and index documents
document_store = InMemoryDocumentStore()
result_docs = doc_embedder.run(documents=[
    Document(content="富士山は日本最高峰の独立峰です。"),
    Document(content="Mount Fuji is Japan's highest peak."),
    Document(content="Achy Breaky Heart is a country song written by Don Von Tress."),
    Document(content="Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland."),
])
document_store.write_documents(result_docs["documents"])

# Embed query and retrieve
result_query = query_embedder.run(text="What is Japan's tallest mountain?")
retriever = InMemoryEmbeddingRetriever(document_store=document_store)
results = retriever.run(query_embedding=result_query["embedding"], top_k=2)
for doc in results["documents"]:
    print(f"{doc.score:.3f}  {doc.content}")
# 0.961  Mount Fuji is Japan's highest peak.
# 0.913  富士山は日本最高峰の独立峰です。

Milvus (pip install pymilvus sentence-transformers)

from pymilvus import MilvusClient
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ibm-granite/granite-embedding-97m-multilingual-r2")

# Use "./milvus.db" for local persistence or a server URI for production
client = MilvusClient(":memory:")
client.create_collection(collection_name="multilingual_docs", dimension=384)

docs = [
    "富士山は日本最高峰の独立峰です。",
    "Mount Fuji is Japan's highest peak.",
    "Achy Breaky Heart is a country song written by Don Von Tress.",
    "Berlin ist die Hauptstadt und ein Land der Bundesrepublik Deutschland.",
]
embeddings = model.encode(docs).tolist()
client.insert(
    collection_name="multilingual_docs",
    data=[{"id": i, "vector": emb, "text": doc} for i, (emb, doc) in enumerate(zip(embeddings, docs))],
)

query_emb = model.encode(["What is Japan's tallest mountain?"]).tolist()
results = client.search(
    collection_name="multilingual_docs",
    data=query_emb,
    limit=2,
    output_fields=["text"],
)
for hit in results[0]:
    print(f"{hit['distance']:.3f}  {hit['entity']['text']}")
# 0.961  Mount Fuji is Japan's highest peak.
# 0.913  富士山は日本最高峰の独立峰です。

Both models also ship with pre-converted ONNX and OpenVINO weights for optimized CPU/accelerator inference, work as embedding endpoints via vLLM (vllm serve ... --task embed), and can be converted to GGUF for Ollama using llama.cpp. See the model cards for full deployment examples.

For Framework Integrators

If you maintain an embedding framework, vector store, or RAG pipeline library and are evaluating these models as a default, here's what you need to know:

License: Apache 2.0, trained without MS-MARCO
Drop-in behavior: No task-specific instruction prefix required — behaves like all-MiniLM-L6-v2 at the API level. Existing code that calls .encode() works unchanged.
Dimensionality: 384-dimensional output (97M) and 768-dimensional output (311M), matching the most common existing defaults. No index migration required.
Model size: The 97M model's weights are 195 MB (safetensors) — less than half the size of paraphrase-multilingual-MiniLM-L12-v2 (471 MB), the most common multilingual default. The quantized ONNX weights are just 98 MB, comparable to all-MiniLM-L6-v2 (91 MB) while covering 200+ languages.
CPU-friendly: Ships with ONNX and OpenVINO weights for optimized CPU inference. No GPU dependency for a getting-started tutorial.
Multilingual by default: If your current default is English-only, this is a one-line swap that gives every user in your community support for 200+ languages — without touching their code.
Stable identifier: ibm-granite/granite-embedding-97m-multilingual-r2 on Hugging Face, maintained by IBM under the Granite model family.

To discuss adopting these models as a default in your project, open an issue at ibm-granite/granite-embedding-models.

Which Model Should You Use?

These two multilingual models are part of the broader Granite Embedding R2 family, which also includes two high-performing English-focused models: granite-embedding-english-r2 (149M parameters) and granite-embedding-small-english-r2 (47M parameters). If your data is predominantly English, the English models offer higher retrieval quality on English benchmarks at a smaller footprint, since they don't need to allocate capacity across 200+ languages.

If you need...	Use
Best multilingual retrieval quality	granite-embedding-311m-multilingual-r2
Flexible embedding dimensions (storage/speed tradeoff)	granite-embedding-311m-multilingual-r2 (Matryoshka)
Maximum throughput / edge deployment / low latency	granite-embedding-97m-multilingual-r2
Best cross-lingual transfer across many language pairs	granite-embedding-311m-multilingual-r2
Predominantly English data	granite-embedding-english-r2 or granite-embedding-small-english-r2

Try The Models

Both models are available now on Hugging Face under the IBM Granite Embedding collection:

You will also be able to try the small models interactively (on CPU) shortly via a Granite Embedding demo (coming soon) on Hugging Face Spaces, or run the full examples notebook in Google Colab:

You can access our detailed technical report covering the full training methodology, per-language evaluations, and pruning ablations here Granite Multilingual Embedding R2 report. For questions, feedback, or issues, visit ibm-granite/granite-embedding-models on GitHub.

Framework maintainers: If you'd like to adopt these models as a default in your project, open an issue at ibm-granite/granite-embedding-models — we're happy to help with integration, testing, and any questions about licensing or deployment.

Give them a try, and if the embeddings spark joy, smash that ❤️ button on Hugging Face. Our models have feelings too, and every +1 keeps them warm at night.

Models mentioned in this article 5

Datasets mentioned in this article 2

Granite 4.1 LLMs: How They’re Built

April 29, 2026

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

March 31, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models mentioned in this article 5

Datasets mentioned in this article 2

Discussion (0)

No comments yet. Sign in and be the first to say something.

Granite Embedding Multilingual R2: Open Apache 2.0 Multilingual Embeddings with 32K Context — Best Sub-100M Retrieval Quality

Enterprise-Ready by Design

A Strong Sub-100M Multilingual Model

What Changed from R1

Training the Full-Size 311M Model

Building the compact 97M Multilingual model

Benchmark Results

Multilingual Retrieval

Speed and Throughput

Matryoshka Embeddings (311M)

Cross-lingual Retrieval

Deployment Options

For Framework Integrators

Which Model Should You Use?

Try The Models

Models mentioned in this article 5

Datasets mentioned in this article 2

Granite 4.1 LLMs: How They’re Built

Granite 4.0 3B Vision: Compact Multimodal Intelligence for Enterprise Documents

Community

Models mentioned in this article 5

Datasets mentioned in this article 2

Discussion (0)

More from Hugging Face