Developing open source LLM from ground up from pretrain - rlhf(PPO/GRPO)
Mirrored from r/LocalLLaMA for archival readability. Support the source by reading on the original site.
Hello I have been working on creating a LLM from ground up. It is based on deepseek architecture with heavily VRAM footprint reduced optimized(GUM+muon)
Currently this is the json schema I am using which should suffice as to what currently is being pretrained.
Training on a single RTX pro 6000 Blackwell!!!!
Testing a 7B parameter model with 64 experts... currently running on single GPU with 100% throughput (hardest part) (~80GB VRAM training) (reduction in expert count will substatially reduce vram footprint.... I am just pushing the limits here!)
My main goal here was simply that open source development will far outpace big firm development. I believe there is someone out there that can use this to build a LLM from group up that can beat all the top 1T parameter model. My goal here to create a large database of trained models that anyone can use. In the future maybe rent models from the open source dev as a support feature. Enough blabbing here is the technical report
since I am using DOLMA/redpajama you can separate the data split and have it train to be good at math, literature, physics... and then ensemble deploy them as agents (This is a todo for now since I don't have a single model to compare against)
This is also following the chinchilla optimal as well! thanks for deepmind!
All bfloat16, can be configured to use fp16 or fp32 if you are from the future and have a GPU that can do fp32 at bf16 speed!
Yes I have lost my mind many times during this, but I got something working!
this is 15000 steps in
======================================================================
[FACTUAL ACCURACY TEST] Step 14000
Prompt: "The capital of France is"
Output: "the city of Nice.
France may also refer to:
France (surname)
France (surname)
France (or Republ..." [CORRECT]
Prompt: "The capital of Japan is"
Output: "the capital of the autonomous prefecture of Hokkaido.
Etymology
The name of Hokkaido is derived fro..." [EXPECTED: Tokyo]
Prompt: "def fibonacci(n):
"""Return the nth Fibonacci ..."
Output: """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""..."
Prompt: "import torch
import torch.nn as nn
class Transfor..."
Output: "// InverterBlock
// s2, s2, s3
// A_1, A_2, A_3, A_4, A_5, A_6
// A1, A2, A3, A4, A5
// A1, A2, A3, ..."
Prompt: "The theory of relativity states that"
Output: "the speed of light varies with the speed of the observer. This is a constant, since the speed of lig..."
Prompt: "In machine learning, gradient descent is used to"
Output: "perform a gradient descent, where the gradient is calculated via a local gradient. The gradient eval..."
Prompt: "Question: What is 2 + 2?
Answer:"
Output: "2 + 2
Author: PCR
Date Submitted: 2nd April 2013
Pp: 200-201
Exercise: Exercise 2.0
2 + 1 = 2 +..." [EXPECTED: 4]
Prompt: "Question: Explain the concept of recursion.
Answer..."
Output: "In programming, a function or sequence of operations is a function that can transform a variable to ..."
FACTUAL ACCURACY: 1/3 = 33.3%
----------------------------------------------------------------------
[SMBench] Step 14000 -- 1/5: Multi-Rule Reasoning
.
.
.
JSON struct defining the arch
"experiment_name": "deepseek_v3_7b_lowvram",
"output_dir": "*******",
"seed": 420,
"model": {
"num_layers": 24,
"vocab_size": 50304,
"norm_type": "rmsnorm",
"norm_eps": 1e-06,
"tie_word_embeddings": false,
"init_method_std": 0.006,
"first_k_dense_replace": 8,
"dense_layer_interval": 1,
"paper_compliant": false,
"mla": {
"d_model": 1408,
"d_latent": 352,
"num_heads": 22,
"num_kv_heads": 2,
"max_context_length": 4096,
"use_flash_mla": false,
.
.
.
},
"moe": {
"num_experts": 64,
"num_experts_per_token": 4,
"expert_intermediate_size": 1536,
"expert_dim": 1536,
"dropout": 0.0,
"num_shared_experts": 1,
.
.
.
.
}
},
"fusions": {
"use_fused_expert_ffn": true,
"use_te_fused_topk": false,
"use_te_fused_permute": false,
"use_fused_softmax": true,
"fused_softmax_in_fp32": true,
"use_group_limited_topk": true,
.
.
.
},
"memory_optimization": {
"use_galore": false,
"galore_rank": 256,
"galore_update_proj_gap": 500,
"galore_scale": 1.0,
.
.
.
},
"training": {
"device": "cuda",
"global_batch_size": 256,
"micro_batch_size": 4,
"gradient_accumulation_steps": 64,
"seq_length": 1024,
"max_batch_seq_multiplier": 1.25,
"tokens_per_parameter_ratio": 40.0,
"total_training_tokens": 280000000000,
"learning_rate": 0.00042,
"min_learning_rate": 4.2e-05,
"lr_preset": "deepseek_v3",
.
.
.
},
"data": {
"use_multi_source": true,
"sources": [
{
"name": "redpajama",
"type": "dolma",
"subset": "dolma_v1_6_redpajama",
"weight": 0.45,
"description": "RedPajama - CommonCrawl-like diverse web/code/books"
},
{
"name": "stack",
"type": "dolma",
"subset": "dolma_v1_6_stack",
"weight": 0.25,
.
.
.
],
"cache_dir": "*******",
"sanitization": {
"enabled": true,
"target_language": "en",
"min_language_confidence": 0.9,
"min_article_length": 100,
.
.
.
},
"preprocessing": {
"num_workers": 8,
"shuffle": true,
"shuffle_seed": 42,
.
.
.
},
"max_articles": null,
"focus_historical": false,
"boost_hiroshima_content": false
},
"distributed": {
"backend": "nccl",
"launcher": "single_gpu",
"tensor_parallel_size": 1,
"pipeline_parallel_size": 1,
"expert_parallel_size": 1,
"data_parallel_size": 1,
"zero_stage": 2,
"zero_offload": true,
"overlap_grad_reduce": true,
"overlap_param_gather": true,
"deepspeed": {
"enabled": false
}
},
"checkpointing": {
"save_interval": 1000,
"save_total_limit": 3,
"resume_from_checkpoint": null,
"checkpoint_format": "pytorch",
"save_optimizer_states": true
},
"logging": {
"log_level": "INFO",
"log_interval": 100,
"tensorboard_dir": "*******",
"wandb": {
"enabled": false
},
"tensorboard": {
"enabled": true
}
},
"validation": {
"enabled": true,
"eval_interval": 1000,
"eval_samples": 500,
"metrics": [
"loss",
"perplexity"
],
"patience": 300,
"early_stopping": false
},
"profiling": {
"trace_nvtx": false
},
"gpu_optimization": {
"cuda_graphs": true,
"torch_compile": true,
"flash_attention": true,
"fused_kernels": true,
"autocast_dtype": "bfloat16"
},
"test_prompts": {
"enabled": true,
So I basically researched and threw every optimization on this planet earth. Even tried to build my own FlashMLA for sm120 blackwell arch and failed miserably although I got inference working... backwards I couldn't due to tiling which ends up being the same if not worse than Aeten torch backend......
But this is working for now, 20seconds a step
eg
Training: 1%|█ | 14609/1000000 [53:18:23<5533:28:53, 23.37s/step, loss=2.1507, mtp=1.9643, ent=4.12, util=100.0%, imbal=0.26, lr=4.20e-04, tok=2.23B]
So in conclusion
I am scared as shit to open source this until I get it working 100% so as to minimize the community hate I will eventually get.
The only point of contention I have is I want all models trained using this to be public I don't want anyone to privatize without open-sourcing for profit so I need to ask around and figure out how to go about this since I want as many models that can be trained using this since I believe there is someone out there with the right configuration already in mind that will beat out the top performing model. This is mainly why I did this, I know I can't create THAT model, but I know for sure as shit there is some genius out there that can train a model that will be SOTA.
There is alot of cleaning up to do before I make it public because scared of the hate and issues I surely cannot fix alone!
If you are interested you can check my account periodically whenever I make a post about making this repo public! or check my github which would be easier I assume lol
https://github.com/IISuperluminaLII
I dont know.. I am open to feedback on how to properly make this public and make it a strict rule to open source all safetensors or checkpoints if using this code... I know there is someone out there given the right tools that can truly build a 10B-50B parameter model ensemble set of models that can achieve near SOTA level performance!! As they always say, divide and conquer
This is getting long already, I have puked my brains out as much as I can. Any input is welcome, even hate! let me know how to fix this so I can deliver the tool the random person who will eventually create the perfect open source model.
[link] [comments]
More from r/LocalLLaMA
-
Orthrus-Qwen3-8B : up to 7.8×tokens/forward on Qwen3-8B, frozen backbone, provably identical output distribution
May 15
-
I built a self-hosted open-source MCP server that gives any local LLM real financial data — SEC filings, 13F, insider & congressional trades, short data, FRED
May 15
-
Qwen 3.6 27B: IQ3XXS KV Q8 vs Q4XL KV Q4 (262K context)
May 15
-
Built a fully offline suitcase robot around a Jetson Orin NX SUPER 16GB. Gemma 4 E4B, ~200ms cached TTFT, 30+ sensors, no WiFi/BT/cellular. He has opinions.
May 15
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.