Training GPT-like model on non-language series [R]
Mirrored from r/MachineLearning for archival readability. Support the source by reading on the original site.
I am responsible for a research project that is supposed to train a GPT-like model (Transformer-decoder) with 100M, 250M and 500M model variants.
# params
## training dataset
- 750M tokens
- vocabulary is ~15k to ~100k tokens (depends on tokenizer settings)
- ~3% of the vocabulary is used in ~50% of the training tokens (similar to language, where most of the vocabulary is used very sparsely)
## training hyper-params
- optimizer = AdamW
- lr = 1e-3 (works the best compared to 1e-2 and 1e-4)
- betas = [0.9, 0.95]
- effective batch size = 4M tokens
- epoch = 16
- warmup steps ~200 (approx 1 epoch)
## model hyper-params
- 16 layers (but variants with up to 48 layers were tested)
- embedding = flexible to yield 100M, 250M and 500M model
- MLP size = 4*n_embd
- 16 attention heads
- context window = 1000
# Issue
The model seems to fail to learn the basic auto-regressive behavior. It often gets stuck on generating a single token (no repetition penalty, no sampling yet).
Is training GPT-like models still a black magic? Is there some trick to this?
*Disclaimer*: I will add/edit the parameters above as people ask clarifying questions.
[link] [comments]
More from r/MachineLearning
-
BEAM 100K memory benchmark: CSM vs Hindsight local artifact comparison [R]
May 27
-
Cross-Platform Fused MoE Dispatch in Triton: Portable Expert Routing Without CUDA [R]
May 27
-
"Unified Neural Scaling Laws" paper release [R]
May 27
-
[R] What 1000+ Harness Experiments Taught Me About Self-Improving Agents [R]
May 27
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.