r/LocalLLaMA · · 1 min read

Instead of decentralized training effort we should build the “One dataset”

Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.

There are many threads here calling for united LLM training run of a new open model. Mainly, after govt. stunt of banning commercial frontier models. And also due to the lack of small-medium open-weight models releases lately.

I genuinelly believe at some point we’ll have “SETI for LLM”. But not anytime soon, not this year. It requires a serious primary research of a training algorhytms over high latency network(s).

What I believe be much more valuable, is to prepare a pre-training data for such future training run. It is much less “super-hard-skill” task. There can be clients invented (vibe engineered) similar to bittorrent downloaders that do scraping, cleaning and hosting (sharing) of the data from the Internet. A new global database with trillions of high quality tokens, openly available, hosted on people’s computers would represent a true message of open-source community to billionates stealing our data and VRAM.

Let’s not dream about distributed LLM training on our home GPUs. We should focus on something more practical. The mere existence of a such dataset would accelerate the development of distri-train on its own.

submitted by /u/srigi
[link] [comments]

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from r/LocalLLaMA