100 Trillion+ Pretraining data??? This is the largest data I've see a model being trained on.
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
| Edit: This is about Minimax-M3, I just realised I didn't mention it lol Usually we see 27-50 Trillion tokens in most models, kimi, mimo, deepseek. They seem to have doubled the pretraining data. Minimax-m2.5 was like 27T tokens. If we see mimo, they have done: - 27T for the Mimo-v2.5-Pro 1 Trillion Parameters - 48T for the smaller Mimo-v2.5 model which is multimodal. - 32T for Deepseek V4 Flash and Pro I find it difficult to believe this model will be much bigger than the previous M2 series models. The training data scale is way too big, and will require way more resources for a much bigger model. M3 seems likely to be under 500B params. [link] [comments] |
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.