A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Hello everyone.
The new dataset is named MONET, is Apache 2.0 and available on HF:
https://huggingface.co/datasets/jasperai/monet
MONET is open, Apache 2.0-licensed image–text dataset. It was built from 2.9 billion images and refined to 104.9 million high-quality samples.
We are also publishing a paper that explains how the dataset was created if you are curious and 3 compagnions projects
- A umap to visualize the distribution
- A retreival tool to do text or image search
- A codebase to train T2i model based on MONET
Hope this will be usefull!
[link] [comments]
More from r/MachineLearning
-
Kept context-switching between arxiv, OpenReview, GitHub, and HuggingFace for every paper, so I built this. Chrome extension + website with everything inline, plus citation graph + SPECTER2 neighbors. 3M papers, free, feedback welcome [P]
May 28
-
Built a richer reading layer for arxiv (Chrome extension + web): OpenReview reviews, GitHub/HuggingFace links, citation graph, SPECTER2 neighbors, TLDRs. 3M papers, free, looking for feedback [P]
May 28
-
ACM MM 2026 review discussion [D]
May 28
-
Training GPT-like model on non-language series [R]
May 28
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.