Best examples of ML projects with good dataset/task code abstractions? [D]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
I am working on a benchmark and need to manage several interlocking components: datasets and metadata, diverse ML tasks (varying inputs and outputs), and baseline experiments covering models, training, and evaluations. Any pointers to projects that handle these through clean/minimal data structures like Dataclasses or Pydantic. Specifically, I want to see how others manage:
- Dataset Information: Representing dataset cards, metadata, and split definitions as first-class objects.
- Task Schemas: Defining ML tasks with specific input and output types to ensure consistency across different models.
- Experiment Composition: Structures that link a model and training configuration to a specific evaluation and prediction set.
If you have seen repositories that maintain these abstractions with minimal boilerplate and high type safety, please share them. I am interested in internal code organization rather than external tools like W&B or MLflow. Definitely aware of cookie-cutter data-science, looking for for datastructures.
[link] [comments]
More from r/MachineLearning
-
Trained transformer-based chess models to play like humans (including thinking time) [P]
May 13
-
Scenema Audio: Zero-shot expressive voice cloning and speech generation [N]
May 13
-
What kinds of models are people training with document data? [P]
May 13
-
Have the "on-hold" durations been getting longer for arXiv submissions? [D]
May 13
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.