How are you handling training data when public datasets don't match your use case? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
Public datasets on HF or Kaggle can sometimes be too generic, wrong domain, wrong schema, outdated, or just not enough volume to generalize properly. Collecting real-world proprietary data takes months. What do people actually do? From what I have seen, the options tend to be:
- Ship with what you have and accept degraded performance
- Spend weeks scraping and cleaning, which eats engineering time
- Augmentation techniques like SMOTE or noise injection, which help at the margins but do not solve domain specificity
I am working on a project that approaches this differently. Sourcing permissively licensed real-world data, curating it to a company's specified schema, then running synthetic expansion to hit the volume and edge case coverage the model actually needs. Every output includes a fidelity report showing statistical alignment between the synthetic output and the source distribution.
Before going further with it, I genuinely want to know whether this is a pain people feel acutely or whether most teams have found workarounds that make something like this unnecessary.
If you are hitting a data wall on something you are building right now, I would love to hear what the specific bottleneck looks like.
What has worked for you?
[link] [comments]
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.