r/MachineLearning · · 1 min read

Best examples of ML projects with good dataset/task code abstractions? [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

I am working on a benchmark and need to manage several interlocking components: datasets and metadata, diverse ML tasks (varying inputs and outputs), and baseline experiments covering models, training, and evaluations. Any pointers to projects that handle these through clean/minimal data structures like Dataclasses or Pydantic. Specifically, I want to see how others manage:

  1. Dataset Information: Representing dataset cards, metadata, and split definitions as first-class objects.
  2. Task Schemas: Defining ML tasks with specific input and output types to ensure consistency across different models.
  3. Experiment Composition: Structures that link a model and training configuration to a specific evaluation and prediction set.

If you have seen repositories that maintain these abstractions with minimal boilerplate and high type safety, please share them. I am interested in internal code organization rather than external tools like W&B or MLflow. Definitely aware of cookie-cutter data-science, looking for for datastructures.

submitted by /u/LetsTacoooo
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning