What kinds of models are people training with document data? [P]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
We've helped some folks with synthetic data for a number of different projects and some of them for "document data". Like annotated PDFs, PNGs. Tax forms, health forms. Especially things with PII that are hard to get because of obvious privacy concerns. So, we came up with an engine to build a simulation and then extract the data from that simulation.
We're trying to make sure our pipeline fits into a normal training pipeline, so I'm curious about your workflows or training pipelines. Today we output in formats consistent with FUNSD, BIO, YOLO (like v5 and higher), Donut, COCO, etc. Are we shooting for the right stuff, or are people training for something different that could use a different format or ontology or something?
Other things we're trying to figure out are like is a PyPi SDK package useful, do people just use the API and not care, shut up and give me a zip file? :-)
[link] [comments]
More from r/MachineLearning
-
Have the "on-hold" durations been getting longer for arXiv submissions? [D]
May 13
-
Image generation models running locally on limited resources [P]
May 13
-
EEML Summer School (Eastern European ML) - Anyone here got accepted? [D]
May 13
-
Best examples of ML projects with good dataset/task code abstractions? [D]
May 13
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.