X thread: <a href=\"https://x.com/mseyed/status/2059504005387284629\" rel=\"nofollow\">https://x.com/mseyed/status/2059504005387284629</a><br>Docs: <a href=\"https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/embedding-2\" rel=\"nofollow\">https://docs.cloud.google.com/gemini-enterprise-agent-platform/models/gemini/embedding-2</a></p>\n","updatedAt":"2026-05-27T10:09:23.222Z","author":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","fullname":"Niels Rogge","name":"nielsr","type":"user","isPro":false,"isHf":true,"isHfAdmin":false,"isMod":false,"followerCount":1214,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.5310952663421631},"editors":["nielsr"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.27295","authors":[{"_id":"6a16c2a2991d34bf20350095","name":"Madhuri Shanbhogue","hidden":false},{"_id":"6a16c2a2991d34bf20350096","name":"Zhe Li","hidden":false},{"_id":"6a16c2a2991d34bf20350097","name":"Shanfeng Zhang","hidden":false},{"_id":"6a16c2a2991d34bf20350098","name":"Gustavo Hernández Ábrego","hidden":false},{"_id":"6a16c2a2991d34bf20350099","name":"Shih-Cheng Huang","hidden":false},{"_id":"6a16c2a2991d34bf2035009a","name":"Aashi Jain","hidden":false},{"_id":"6a16c2a2991d34bf2035009b","name":"Daniel Salz","hidden":false},{"_id":"6a16c2a2991d34bf2035009c","name":"Sonam Goenka","hidden":false},{"_id":"6a16c2a2991d34bf2035009d","name":"Chaitra Hegde","hidden":false},{"_id":"6a16c2a2991d34bf2035009e","name":"Ji Ma","hidden":false},{"_id":"6a16c2a2991d34bf2035009f","name":"Feiyang Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500a0","name":"Jiaxing Wu","hidden":false},{"_id":"6a16c2a2991d34bf203500a1","name":"Tanmaya Dabral","hidden":false},{"_id":"6a16c2a2991d34bf203500a2","name":"Babak Samari","hidden":false},{"_id":"6a16c2a2991d34bf203500a3","name":"Kevin Poulet","hidden":false},{"_id":"6a16c2a2991d34bf203500a4","name":"Daniel Cer","hidden":false},{"_id":"6a16c2a2991d34bf203500a5","name":"Kaifeng Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500a6","name":"Paul Suganathan","hidden":false},{"_id":"6a16c2a2991d34bf203500a7","name":"Hui Hui","hidden":false},{"_id":"6a16c2a2991d34bf203500a8","name":"Jovan Andonov","hidden":false},{"_id":"6a16c2a2991d34bf203500a9","name":"Philippe Schlattner","hidden":false},{"_id":"6a16c2a2991d34bf203500aa","name":"Jay Han","hidden":false},{"_id":"6a16c2a2991d34bf203500ab","name":"Iftekhar Naim","hidden":false},{"_id":"6a16c2a2991d34bf203500ac","name":"Wing Lowe","hidden":false},{"_id":"6a16c2a2991d34bf203500ad","name":"Vladimir Pchelin","hidden":false},{"_id":"6a16c2a2991d34bf203500ae","name":"Albert Yang","hidden":false},{"_id":"6a16c2a2991d34bf203500af","name":"Yi-Ting Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500b0","name":"Zhongli Ding","hidden":false},{"_id":"6a16c2a2991d34bf203500b1","name":"Grace Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500b2","name":"Georg Heigold","hidden":false},{"_id":"6a16c2a2991d34bf203500b3","name":"Yichang Chen","hidden":false},{"_id":"6a16c2a2991d34bf203500b4","name":"Antoine Reveillon","hidden":false},{"_id":"6a16c2a2991d34bf203500b5","name":"Brendan Mccloskey","hidden":false},{"_id":"6a16c2a2991d34bf203500b6","name":"Wenlei Zhou","hidden":false},{"_id":"6a16c2a2991d34bf203500b7","name":"Dahun Kim","hidden":false},{"_id":"6a16c2a2991d34bf203500b8","name":"Rui Meng","hidden":false},{"_id":"6a16c2a2991d34bf203500b9","name":"Emma Wang","hidden":false},{"_id":"6a16c2a2991d34bf203500ba","name":"Jack Zheng","hidden":false},{"_id":"6a16c2a2991d34bf203500bb","name":"Halley Fede","hidden":false},{"_id":"6a16c2a2991d34bf203500bc","name":"Zhen Yang","hidden":false},{"_id":"6a16c2a2991d34bf203500bd","name":"Keegan Mosley","hidden":false},{"_id":"6a16c2a2991d34bf203500be","name":"Brian Potetz","hidden":false},{"_id":"6a16c2a2991d34bf203500bf","name":"Sahil Dua","hidden":false},{"_id":"6a16c2a2991d34bf203500c0","name":"Henrique Schechter Vera","hidden":false},{"_id":"6a16c2a2991d34bf203500c1","name":"Shen Gao","hidden":false},{"_id":"6a16c2a2991d34bf203500c2","name":"Hesen Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500c3","name":"Andreas Hess","hidden":false},{"_id":"6a16c2a2991d34bf203500c4","name":"Hengxuan Ying","hidden":false},{"_id":"6a16c2a2991d34bf203500c5","name":"Alberto Montes","hidden":false},{"_id":"6a16c2a2991d34bf203500c6","name":"Karan Gill","hidden":false},{"_id":"6a16c2a2991d34bf203500c7","name":"Min Choi","hidden":false},{"_id":"6a16c2a2991d34bf203500c8","name":"Sebastian Russo","hidden":false},{"_id":"6a16c2a2991d34bf203500c9","name":"Anja Hauth","hidden":false},{"_id":"6a16c2a2991d34bf203500ca","name":"Jinhyuk Lee","hidden":false},{"_id":"6a16c2a2991d34bf203500cb","name":"Michael Boratko","hidden":false},{"_id":"6a16c2a2991d34bf203500cc","name":"Megan Barnes","hidden":false},{"_id":"6a16c2a2991d34bf203500cd","name":"Vikram Rao","hidden":false},{"_id":"6a16c2a2991d34bf203500ce","name":"Claudiu Musat","hidden":false},{"_id":"6a16c2a2991d34bf203500cf","name":"Cyril Allauzen","hidden":false},{"_id":"6a16c2a2991d34bf203500d0","name":"Ehsan Variani","hidden":false},{"_id":"6a16c2a2991d34bf203500d1","name":"Shankar Kumar","hidden":false},{"_id":"6a16c2a2991d34bf203500d2","name":"Tom Bagby","hidden":false},{"_id":"6a16c2a2991d34bf203500d3","name":"Junyi Jiao","hidden":false},{"_id":"6a16c2a2991d34bf203500d4","name":"Yang Gu","hidden":false},{"_id":"6a16c2a2991d34bf203500d5","name":"Tengxin Li","hidden":false},{"_id":"6a16c2a2991d34bf203500d6","name":"Ayush Agrawal","hidden":false},{"_id":"6a16c2a2991d34bf203500d7","name":"Roberto Santana","hidden":false},{"_id":"6a16c2a2991d34bf203500d8","name":"Dev Nath","hidden":false},{"_id":"6a16c2a2991d34bf203500d9","name":"Stephen Karukas","hidden":false},{"_id":"6a16c2a2991d34bf203500da","name":"Shuoxuan Han","hidden":false},{"_id":"6a16c2a2991d34bf203500db","name":"Lucia Loher","hidden":false},{"_id":"6a16c2a2991d34bf203500dc","name":"Alice Twu","hidden":false},{"_id":"6a16c2a2991d34bf203500dd","name":"Nidhi Vyas","hidden":false},{"_id":"6a16c2a2991d34bf203500de","name":"Siddharth Bhai","hidden":false},{"_id":"6a16c2a2991d34bf203500df","name":"Frank Palma Gomez","hidden":false},{"_id":"6a16c2a2991d34bf203500e0","name":"Wangyuan Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500e1","name":"Chaoren Liu","hidden":false},{"_id":"6a16c2a2991d34bf203500e2","name":"Jizheng Yang","hidden":false},{"_id":"6a16c2a2991d34bf203500e3","name":"Steve Qiu","hidden":false},{"_id":"6a16c2a2991d34bf203500e4","name":"Shijie Zhang","hidden":false},{"_id":"6a16c2a2991d34bf203500e5","name":"Sujay Kulkarni","hidden":false},{"_id":"6a16c2a2991d34bf203500e6","name":"Sascha Rothe","hidden":false},{"_id":"6a16c2a2991d34bf203500e7","name":"Sean Nakamoto","hidden":false},{"_id":"6a16c2a2991d34bf203500e8","name":"Raphael Hoffmann","hidden":false},{"_id":"6a16c2a2991d34bf203500e9","name":"Zach Gleicher","hidden":false},{"_id":"6a16c2a2991d34bf203500ea","name":"Yunhsuan Sung","hidden":false},{"_id":"6a16c2a2991d34bf203500eb","name":"Qin Yin","hidden":false},{"_id":"6a16c2a2991d34bf203500ec","name":"Tom Duerig","hidden":false},{"_id":"6a16c2a2991d34bf203500ed","name":"Mojtaba Seyedhosseini","hidden":false}],"publishedAt":"2026-05-26T00:00:00.000Z","submittedOnDailyAt":"2026-05-27T00:00:00.000Z","title":"Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini","submittedOnDailyBy":{"_id":"5f1158120c833276f61f1a84","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1608042047613-5f1158120c833276f61f1a84.jpeg","isPro":false,"fullname":"Niels Rogge","user":"nielsr","type":"user","name":"nielsr"},"summary":"We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.","upvotes":6,"discussionId":"6a16c2a3991d34bf203500ee","projectPage":"https://ai.google.dev/gemini-api/docs/embeddings","ai_summary":"Gemini Embedding 2 is a multimodal embedding model that generates unified representations for video, audio, image, and text data, achieving superior performance across diverse retrieval tasks and demonstrating strong zero-shot capabilities across specialized domains.","ai_keywords":["multimodal embedding model","contrastive learning","multi-task multi-stage training","retrieval","RAG","recommendation","search","zero-shot performance"],"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62c3dc0f8d1f84bf5ff47bbc","avatarUrl":"/avatars/10014f61380192cce9a1a85648576e0f.svg","isPro":false,"fullname":"Chanyong Shin","user":"scyonggg","type":"user"},{"_id":"63c1699e40a26dd2db32400d","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/63c1699e40a26dd2db32400d/3N0-Zp8igv8-52mXAdiiq.jpeg","isPro":false,"fullname":"Chroma","user":"Chroma111","type":"user"},{"_id":"671b5ce59e5016396edcc78a","avatarUrl":"/avatars/d4af23e312f1e90b9419b1cd8e908b87.svg","isPro":false,"fullname":"ZhengQi Wan","user":"Vanqi","type":"user"},{"_id":"69a5cba5ee290d6bb49457b8","avatarUrl":"/avatars/f80c17c13d6baf6bcd375d31efe21116.svg","isPro":false,"fullname":"Darrow O'Lykos","user":"darrowoflykos","type":"user"},{"_id":"69980785a8cfb9ac7d7ecd35","avatarUrl":"/avatars/6d652fdbec005157e2f8d32e1f5210a2.svg","isPro":false,"fullname":"Masonherna01","user":"masonherna01","type":"user"},{"_id":"6903963c5f219837e23bcc25","avatarUrl":"/avatars/433678dbefadf050ab57228eea46b390.svg","isPro":false,"fullname":"Helmi ghanmi","user":"Helmidev","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"5e6aca39878b8b2bf9806447","name":"google","fullname":"Google","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/5dd96eb166059660ed1ee413/WtA3YYitedOr9n02eHfJe.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.27295.md"}">
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
Authors: ,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Gemini Embedding 2 is a multimodal embedding model that generates unified representations for video, audio, image, and text data, achieving superior performance across diverse retrieval tasks and demonstrating strong zero-shot capabilities across specialized domains.
AI-generated summary
We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.
Community
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.27295 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.27295 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.27295 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.