Hugging Face Daily Papers · · 4 min read

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

X thread: <a href=\"https://x.com/mseyed/status/2059504005387284629\" rel=\"nofollow\">https://x.com/mseyed/status/2059504005387284629</a><br>Docs: <a href=\"https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/embedding-2\" rel=\"nofollow\">https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/embedding-2</a></p>\n","updatedAt":"2026-05-27T10:09:23.222Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1214,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5310952663421631},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27295","authors":[{"_id":"6a16c2a2991d34bf20350095","name":"Madhuri Shanbhogue","hidden":false},{"_id":"6a16c2a2991d34bf20350096","name":"Zhe Li","hidden":false},{"_id":"6a16c2a2991d34bf20350097","name":"Shanfeng Zhang","hidden":false},{"_id":"6a16c2a2991d34bf20350098","name":"Gustavo Hernández Ábrego","hidden":false},{"_id":"6a16c2a2991d34bf20350099","name":"Shih-Cheng Huang","hidden":false},{"_id":"6a16c2a2991d34bf2035009a","name":"Aashi Jain","hidden":false},{"_id":"6a16c2a2991d34bf2035009b","name":"Daniel Salz","hidden":false},{"_id":"6a16c2a2991d34bf2035009c","name":"Sonam Goenka","hidden":false},{"_id":"6a16c2a2991d34bf2035009d","name":"Chaitra Hegde","hidden":false},{"_id":"6a16c2a2991d34bf2035009e","name":"Ji Ma","hidden":false},{"_id":"6a16c2a2991d34bf2035009f","name":"Feiyang Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500a0","name":"Jiaxing Wu","hidden":false},{"_id":"6a16c2a2991d34bf203500a1","name":"Tanmaya Dabral","hidden":false},{"_id":"6a16c2a2991d34bf203500a2","name":"Babak Samari","hidden":false},{"_id":"6a16c2a2991d34bf203500a3","name":"Kevin Poulet","hidden":false},{"_id":"6a16c2a2991d34bf203500a4","name":"Daniel Cer","hidden":false},{"_id":"6a16c2a2991d34bf203500a5","name":"Kaifeng Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500a6","name":"Paul Suganathan","hidden":false},{"_id":"6a16c2a2991d34bf203500a7","name":"Hui Hui","hidden":false},{"_id":"6a16c2a2991d34bf203500a8","name":"Jovan Andonov","hidden":false},{"_id":"6a16c2a2991d34bf203500a9","name":"Philippe Schlattner","hidden":false},{"_id":"6a16c2a2991d34bf203500aa","name":"Jay Han","hidden":false},{"_id":"6a16c2a2991d34bf203500ab","name":"Iftekhar Naim","hidden":false},{"_id":"6a16c2a2991d34bf203500ac","name":"Wing Lowe","hidden":false},{"_id":"6a16c2a2991d34bf203500ad","name":"Vladimir Pchelin","hidden":false},{"_id":"6a16c2a2991d34bf203500ae","name":"Albert Yang","hidden":false},{"_id":"6a16c2a2991d34bf203500af","name":"Yi-Ting Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500b0","name":"Zhongli Ding","hidden":false},{"_id":"6a16c2a2991d34bf203500b1","name":"Grace Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500b2","name":"Georg Heigold","hidden":false},{"_id":"6a16c2a2991d34bf203500b3","name":"Yichang Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500b4","name":"Antoine Reveillon","hidden":false},{"_id":"6a16c2a2991d34bf203500b5","name":"Brendan Mccloskey","hidden":false},{"_id":"6a16c2a2991d34bf203500b6","name":"Wenlei Zhou","hidden":false},{"_id":"6a16c2a2991d34bf203500b7","name":"Dahun Kim","hidden":false},{"_id":"6a16c2a2991d34bf203500b8","name":"Rui Meng","hidden":false},{"_id":"6a16c2a2991d34bf203500b9","name":"Emma Wang","hidden":false},{"_id":"6a16c2a2991d34bf203500ba","name":"Jack Zheng","hidden":false},{"_id":"6a16c2a2991d34bf203500bb","name":"Halley Fede","hidden":false},{"_id":"6a16c2a2991d34bf203500bc","name":"Zhen Yang","hidden":false},{"_id":"6a16c2a2991d34bf203500bd","name":"Keegan Mosley","hidden":false},{"_id":"6a16c2a2991d34bf203500be","name":"Brian Potetz","hidden":false},{"_id":"6a16c2a2991d34bf203500bf","name":"Sahil Dua","hidden":false},{"_id":"6a16c2a2991d34bf203500c0","name":"Henrique Schechter Vera","hidden":false},{"_id":"6a16c2a2991d34bf203500c1","name":"Shen Gao","hidden":false},{"_id":"6a16c2a2991d34bf203500c2","name":"Hesen Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500c3","name":"Andreas Hess","hidden":false},{"_id":"6a16c2a2991d34bf203500c4","name":"Hengxuan Ying","hidden":false},{"_id":"6a16c2a2991d34bf203500c5","name":"Alberto Montes","hidden":false},{"_id":"6a16c2a2991d34bf203500c6","name":"Karan Gill","hidden":false},{"_id":"6a16c2a2991d34bf203500c7","name":"Min Choi","hidden":false},{"_id":"6a16c2a2991d34bf203500c8","name":"Sebastian Russo","hidden":false},{"_id":"6a16c2a2991d34bf203500c9","name":"Anja Hauth","hidden":false},{"_id":"6a16c2a2991d34bf203500ca","name":"Jinhyuk Lee","hidden":false},{"_id":"6a16c2a2991d34bf203500cb","name":"Michael Boratko","hidden":false},{"_id":"6a16c2a2991d34bf203500cc","name":"Megan Barnes","hidden":false},{"_id":"6a16c2a2991d34bf203500cd","name":"Vikram Rao","hidden":false},{"_id":"6a16c2a2991d34bf203500ce","name":"Claudiu Musat","hidden":false},{"_id":"6a16c2a2991d34bf203500cf","name":"Cyril Allauzen","hidden":false},{"_id":"6a16c2a2991d34bf203500d0","name":"Ehsan Variani","hidden":false},{"_id":"6a16c2a2991d34bf203500d1","name":"Shankar Kumar","hidden":false},{"_id":"6a16c2a2991d34bf203500d2","name":"Tom Bagby","hidden":false},{"_id":"6a16c2a2991d34bf203500d3","name":"Junyi Jiao","hidden":false},{"_id":"6a16c2a2991d34bf203500d4","name":"Yang Gu","hidden":false},{"_id":"6a16c2a2991d34bf203500d5","name":"Tengxin Li","hidden":false},{"_id":"6a16c2a2991d34bf203500d6","name":"Ayush Agrawal","hidden":false},{"_id":"6a16c2a2991d34bf203500d7","name":"Roberto Santana","hidden":false},{"_id":"6a16c2a2991d34bf203500d8","name":"Dev Nath","hidden":false},{"_id":"6a16c2a2991d34bf203500d9","name":"Stephen Karukas","hidden":false},{"_id":"6a16c2a2991d34bf203500da","name":"Shuoxuan Han","hidden":false},{"_id":"6a16c2a2991d34bf203500db","name":"Lucia Loher","hidden":false},{"_id":"6a16c2a2991d34bf203500dc","name":"Alice Twu","hidden":false},{"_id":"6a16c2a2991d34bf203500dd","name":"Nidhi Vyas","hidden":false},{"_id":"6a16c2a2991d34bf203500de","name":"Siddharth Bhai","hidden":false},{"_id":"6a16c2a2991d34bf203500df","name":"Frank Palma Gomez","hidden":false},{"_id":"6a16c2a2991d34bf203500e0","name":"Wangyuan Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500e1","name":"Chaoren Liu","hidden":false},{"_id":"6a16c2a2991d34bf203500e2","name":"Jizheng Yang","hidden":false},{"_id":"6a16c2a2991d34bf203500e3","name":"Steve Qiu","hidden":false},{"_id":"6a16c2a2991d34bf203500e4","name":"Shijie Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500e5","name":"Sujay Kulkarni","hidden":false},{"_id":"6a16c2a2991d34bf203500e6","name":"Sascha Rothe","hidden":false},{"_id":"6a16c2a2991d34bf203500e7","name":"Sean Nakamoto","hidden":false},{"_id":"6a16c2a2991d34bf203500e8","name":"Raphael Hoffmann","hidden":false},{"_id":"6a16c2a2991d34bf203500e9","name":"Zach Gleicher","hidden":false},{"_id":"6a16c2a2991d34bf203500ea","name":"Yunhsuan Sung","hidden":false},{"_id":"6a16c2a2991d34bf203500eb","name":"Qin Yin","hidden":false},{"_id":"6a16c2a2991d34bf203500ec","name":"Tom Duerig","hidden":false},{"_id":"6a16c2a2991d34bf203500ed","name":"Mojtaba Seyedhosseini","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini","submittedOnDailyBy":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user","name":"nielsr"},"summary":"We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.","upvotes":6,"discussionId":"6a16c2a3991d34bf203500ee","projectPage":"https://ai.google.dev/gemini-api/docs/embeddings","ai_summary":"Gemini Embedding 2 is a multimodal embedding model that generates unified representations for video, audio, image, and text data, achieving superior performance across diverse retrieval tasks and demonstrating strong zero-shot capabilities across specialized domains.","ai_keywords":["multimodal embedding model","contrastive learning","multi-task multi-stage training","retrieval","RAG","recommendation","search","zero-shot performance"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62c3dc0f8d1f84bf5ff47bbc","avatarUrl":"/avatars/10014f61380192cce9a1a85648576e0f.svg","isPro":false,"fullname":"Chanyong Shin","user":"scyonggg","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"671b5ce59e5016396edcc78a","avatarUrl":"/avatars/d4af23e312f1e90b9419b1cd8e908b87.svg","isPro":false,"fullname":"ZhengQi Wan","user":"Vanqi","type":"user"},{"_id":"69a5cba5ee290d6bb49457b8","avatarUrl":"/avatars/f80c17c13d6baf6bcd375d31efe21116.svg","isPro":false,"fullname":"Darrow O'Lykos","user":"darrowoflykos","type":"user"},{"_id":"69980785a8cfb9ac7d7ecd35","avatarUrl":"/avatars/6d652fdbec005157e2f8d32e1f5210a2.svg","isPro":false,"fullname":"Masonherna01","user":"masonherna01","type":"user"},{"_id":"6903963c5f219837e23bcc25","avatarUrl":"/avatars/433678dbefadf050ab57228eea46b390.svg","isPro":false,"fullname":"Helmi ghanmi","user":"Helmidev","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27295.md"}">
Papers
arxiv:2605.27295

Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini

Published on May 26
· Submitted by
Niels Rogge
on May 27
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Gemini Embedding 2 is a multimodal embedding model that generates unified representations for video, audio, image, and text data, achieving superior performance across diverse retrieval tasks and demonstrating strong zero-shot capabilities across specialized domains.

AI-generated summary

We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.27295
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.27295 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.27295 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.27295 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers