Worlds Biggest Chat Title Dataset From SupraLabs
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| If you search "Chat title dataset" on huggingface a few dys ago, the biggest chat title dataset you would get from it was "ogrnz/chat-titles", but recently at supralabs we have curated a 115K filtered dataset whih breaks the world record for the biggest dataset from 10k samples to 115k samples! We've released a set of chat title generation datasets that may be useful for instruction tuning, classification-style title generation, or benchmarking small models. The release includes both a filtered and an unfiltered version: - Filtered: `SupraLabs/chat-titles-filtered-115K` - Unfiltered: `SupraLabs/chat-titles-unfiltered-150K` - Legacy release: `SupraLabs/chat-titles-12K` The filtered version is the one we generally recommend for most training runs, while the unfiltered version is provided for anyone who prefers to apply their own cleaning and filtering pipeline. We're interested in hearing feedback from anyone who experiments with the datasets, especially regarding data quality, filtering approaches, and title generation performance across different model sizes. Questions, suggestions, and criticism are all welcome. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.