r/MachineLearning · · 2 min read

Before we spend months processing open-source robotics datasets, tell us why this is a bad idea [D]

Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.

Ps. Not pitching anything; Just trying to understand where reality differs from the narrative.

We're a couple of ML students, mostly worked on ML/software before, but over the last few months we've been playing with VLAs, robot datasets, and trying to understand where the field is heading.

After spending a few weeks downloading robotics datasets, we were surprised by how much effort went into just getting data into a usable format.

Maybe we're missing something, but it felt like every dataset had different assumptions, schemas, sensors, coordinate frames, metadata standards, and tooling.

That got us wondering:

How do robotics teams actually think about data sharing?

Do people genuinely want access to more robot data, or is the industry moving toward "collect your own data because nobody else's transfers"?

Our current (possibly very wrong) hypothesis is:

The robotics ecosystem doesn't have a data scarcity problem.

It has a data interoperability problem.

We're considering running a pretty large experiment:

Take essentially every public robot-learning dataset we can get our hands on, normalize it into a common schema, enrich it with metadata, and see how much of it is actually reusable across tasks, embodiments, and learning pipelines.

Before we spend months doing that, we'd love to hear from people actually building in robotics.

Where is this hypothesis wrong?

Is finding data not actually a problem?

Is embodiment mismatch the real blocker?

Is quality the issue?

Is labeling the issue?

Is everyone just collecting their own data anyway?

Would you ever use robot data collected by another team?

If I gave you access tomorrow to every public robotics dataset through one API, what would you actually do with it?

Or would you ignore it completely?

------------------------------------------------------------------------------------------------------

Also, if you're working on robotics, Physical AI, robot learning, teleoperation, simulation, data infrastructure, etc., we'd love to chat.

Happy to jump on a call, grab coffee if you're nearby, or just chat in DMs.

submitted by /u/sigma_crusader
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/MachineLearning