r/LocalLLaMA · · 1 min read

Hugging Face Dataset Lineage Explorer

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

Hugging Face Dataset Lineage Explorer

As Hugging Face's Machine Learning Librarian, I am probably more obsessed with metadata than most, but one field in the dataset spec for HF dataset card READMEs is source_datasets. This is very rarely used, so it's quite hard to know how different datasets relate to each other.

To help with this, I did a bit of work with Claude Code to explore if it's possible to detect how datasets have derivatives, i.e. translations, cleaned up versions, etc.

A few things from the analysis:
- alpaca-style datasets have hundreds of derivatives
- "cleaned" variants of the same source proliferate across orgs
- translations and language-filtered subsets are a huge chunk of the long tail

Take these with a pinch of salt since we didn't look at all datasets, so likely the diversity is much higher as you get into less-used datasets (and obviously this doesn't include private datasets)

Also made a Space to explore some of these results: https://huggingface.co/spaces/davanstrien/dataset-lineage-explorer

Alpaca children

submitted by /u/dvanstrien
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA