Hugging Face Dataset Lineage Explorer
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| As Hugging Face's Machine Learning Librarian, I am probably more obsessed with metadata than most, but one field in the dataset spec for HF dataset card READMEs is source_datasets. This is very rarely used, so it's quite hard to know how different datasets relate to each other. To help with this, I did a bit of work with Claude Code to explore if it's possible to detect how datasets have derivatives, i.e. translations, cleaned up versions, etc. A few things from the analysis: Take these with a pinch of salt since we didn't look at all datasets, so likely the diversity is much higher as you get into less-used datasets (and obviously this doesn't include private datasets) Also made a Space to explore some of these results: https://huggingface.co/spaces/davanstrien/dataset-lineage-explorer [link] [comments] |
More from r/LocalLLaMA
-
Qwen3.6 35B-A3B successfully completed the FoodTruck Bench!
May 27
-
SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More
May 27
-
ReAligned-Qwen3.5 Release
May 27
-
KV cache quant benchmarks: q5 & q6 are underrated, q8/q4 is bad, TCQ has a niche
May 27
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.