Token-Superposition Training (TST) is a simple two-phase pre-training method that improves data throughput per FLOP without modifying the model architecture, optimizer, or tokenizer. In the first phase, contiguous tokens are averaged into \"superposed\" embeddings and trained with a multi-hot cross-entropy loss that predicts the next bag of tokens; in the second phase, training reverts to standard next-token prediction. At the 10B active-1B MoE scale, TST achieves a 2.5× reduction in pre-training time to reach the same loss as the baseline, while also improving downstream performance on benchmarks such as HellaSwag, ARC, and MMLU.</p>\n","updatedAt":"2026-05-13T16:08:07.663Z","author":{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","fullname":"Bowen Peng","name":"bloc97","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":34,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.9011166095733643},"editors":["bloc97"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.06546","authors":[{"_id":"6a04a080b1a8cbabc9f084dd","name":"Bowen Peng","hidden":false},{"_id":"6a04a080b1a8cbabc9f084de","name":"Théo Gigant","hidden":false},{"_id":"6a04a080b1a8cbabc9f084df","name":"Jeffrey Quesnelle","hidden":false}],"publishedAt":"2026-05-07T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"Efficient Pre-Training with Token Superposition","submittedOnDailyBy":{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","isPro":false,"fullname":"Bowen Peng","user":"bloc97","type":"user","name":"bloc97"},"summary":"Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.","upvotes":23,"discussionId":"6a04a080b1a8cbabc9f084e0","projectPage":"https://nousresearch.com/token-superposition","ai_summary":"Token-Superposition Training (TST) improves pre-training efficiency by combining contiguous tokens into bags during a superposition phase with multi-hot cross-entropy objective, achieving faster training times without architectural changes.","ai_keywords":["Token-Superposition Training","multi-hot cross-entropy","pre-training","FLOPs","data throughput","parallelism","optimizer","tokenizer","model architecture","superposition phase","recovery phase"],"organization":{"_id":"643b858ba856622f9790cc66","name":"NousResearch","fullname":"NousResearch","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"630581db99870e13d3e0006f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676652577978-630581db99870e13d3e0006f.jpeg","isPro":true,"fullname":"Jeffrey Quesnelle","user":"emozilla","type":"user"},{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","isPro":false,"fullname":"Bowen Peng","user":"bloc97","type":"user"},{"_id":"639546df76d8296e1146480f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639546df76d8296e1146480f/O-5dxEKMlxO7LREB3yYW-.jpeg","isPro":false,"fullname":"Subho Ghosh","user":"ighoshsubho","type":"user"},{"_id":"632270be7bb41a713db09225","avatarUrl":"/avatars/febca473e0a7007535c7c724a5904047.svg","isPro":false,"fullname":"𒐪","user":"shl0","type":"user"},{"_id":"64b21cbb2fc8324fcb1dac03","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b21cbb2fc8324fcb1dac03/6842IGsvYfcNxTrZau3-i.png","isPro":false,"fullname":"ari lotter","user":"apyh","type":"user"},{"_id":"64b24479cb28be619964952c","avatarUrl":"/avatars/d7089eaea014fb6b11de11f7872e7652.svg","isPro":false,"fullname":"Nobody.png","user":"NobodyExistsOnTheInternet","type":"user"},{"_id":"60d35154d7b174177faabd55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d35154d7b174177faabd55/_if2cJtR1Um5R5xavy6Kk.jpeg","isPro":false,"fullname":"théo gigant","user":"gigant","type":"user"},{"_id":"633afc5b5df91da9ceac9394","avatarUrl":"/avatars/41e9a6964252dd9dbc167caa32184790.svg","isPro":false,"fullname":"Kain Kraken","user":"kainan33","type":"user"},{"_id":"67bc6fd1cbde0977f1e12524","avatarUrl":"/avatars/f5fa4ef814ef2ac7c4deb9ee9497c43d.svg","isPro":false,"fullname":"Austin pickett","user":"phragg","type":"user"},{"_id":"62e0626a1b0ece20b8aaf2b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e0626a1b0ece20b8aaf2b8/TFFSkfqIcrHgoufx8HGya.png","isPro":false,"fullname":"neuralink","user":"neuralink","type":"user"},{"_id":"6a04a674b61c8c8aaa7b0de3","avatarUrl":"/avatars/be5b4d7667f44f1c560be2c5a1533714.svg","isPro":false,"fullname":"alex c","user":"sdfsa44","type":"user"},{"_id":"65e20784f70c00af9691d07b","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65e20784f70c00af9691d07b/Vu12YG4XnIjkAraLrcZSd.jpeg","isPro":false,"fullname":"yoni","user":"yoniebans","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643b858ba856622f9790cc66","name":"NousResearch","fullname":"NousResearch","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png"}}">
Efficient Pre-Training with Token Superposition
Abstract
Token-Superposition Training (TST) improves pre-training efficiency by combining contiguous tokens into bags during a superposition phase with multi-hot cross-entropy objective, achieving faster training times without architectural changes.
AI-generated summary
Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training (TST), a simple drop-in method that significantly improves the data throughput per FLOPs during pre-training without modifying the parallelism, optimizer, tokenizer, data, or model architecture. TST is done in two phases: (i) A highly efficient superposition phase where we combine many contiguous tokens into one bag and train using a multi-hot cross-entropy (MCE) objective, and (ii) a recovery phase where we revert back to standard training. We extensively evaluate TST on the scale of 270M and 600M parameters and validate on 3B and a 10B A1B mixture of experts model, demonstrating that it is highly robust in different settings. Ultimately, TST consistently outperforms baseline loss and downstream evaluations, and under equal-loss settings, TST yields up to a 2.5x reduction in total pre-training time at the 10B A1B scale.
Community
Token-Superposition Training (TST) is a simple two-phase pre-training method that improves data throughput per FLOP without modifying the model architecture, optimizer, or tokenizer. In the first phase, contiguous tokens are averaged into "superposed" embeddings and trained with a multi-hot cross-entropy loss that predicts the next bag of tokens; in the second phase, training reverts to standard next-token prediction. At the 10B active-1B MoE scale, TST achieves a 2.5× reduction in pre-training time to reach the same loss as the baseline, while also improving downstream performance on benchmarks such as HellaSwag, ARC, and MMLU.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.06546 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.06546 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.