Hugging Face Daily Papers · · 3 min read

NITP: Next Implicit Token Prediction for LLM Pre-training

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Accepted to ICML 2026.</p>\n","updatedAt":"2026-06-02T02:22:30.668Z","author":{"_id":"656d8d4b1f8d9b618de91369","avatarUrl":"/avatars/884dba9e56936241034b179d11a513b9.svg","fullname":"Xiangdong Zhang","name":"aHapBean","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":1,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8974049687385559},"editors":["aHapBean"],"editorAvatarUrls":["/avatars/884dba9e56936241034b179d11a513b9.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.24956","authors":[{"_id":"6a17e5236916a055bfaabc48","name":"Xiangdong Zhang","hidden":false},{"_id":"6a17e5236916a055bfaabc49","name":"Debing Zhang","hidden":false},{"_id":"6a17e5236916a055bfaabc4a","name":"Shaofeng Zhang","hidden":false},{"_id":"6a17e5236916a055bfaabc4b","name":"Xiaohan Qin","hidden":false},{"_id":"6a17e5236916a055bfaabc4c","name":"Yu Cheng","hidden":false},{"_id":"6a17e5236916a055bfaabc4d","name":"Junchi Yan","hidden":false}],"publishedAt":"2026-05-24T00:00:00.000Z","submittedOnDailyAt":"2026-06-02T00:00:00.000Z","title":"NITP: Next Implicit Token Prediction for LLM Pre-training","submittedOnDailyBy":{"_id":"656d8d4b1f8d9b618de91369","avatarUrl":"/avatars/884dba9e56936241034b179d11a513b9.svg","isPro":false,"fullname":"Xiangdong Zhang","user":"aHapBean","type":"user","name":"aHapBean"},"summary":"Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.","upvotes":16,"discussionId":"6a17e5236916a055bfaabc4e","githubRepo":"https://github.com/aHapBean/NITP","githubRepoAddedBy":"user","ai_summary":"Next Implicit Token Prediction enhances language model training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead.","ai_keywords":["next-token prediction","language models","discrete labels","output logit space","latent representation space","implicit semantic content","shallow-layer representations","optimization landscape","representation geometry","dense continuous supervision","MoE models","MMLU-Pro","C3","CommonsenseQA"],"githubStars":21,"organization":{"_id":"686ccb41f10ff747aaea3f13","name":"Shanghai-Jiao-Tong-University-SAI","fullname":"Shanghai Jiao Tong University SAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/686cc925516900c6865a1d56/KmvImL_J2ZEHdVFumR3pO.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"656d8d4b1f8d9b618de91369","avatarUrl":"/avatars/884dba9e56936241034b179d11a513b9.svg","isPro":false,"fullname":"Xiangdong Zhang","user":"aHapBean","type":"user"},{"_id":"66ea643899af9ac3463639b1","avatarUrl":"/avatars/252d470e761a57834dee3dbc60dfefed.svg","isPro":false,"fullname":"Disen Lan","user":"landisen","type":"user"},{"_id":"64ba47b129d10d4185c46af1","avatarUrl":"/avatars/84a776d283b01f0558a28a5625115f83.svg","isPro":false,"fullname":"Zhilin Wang","user":"linzw","type":"user"},{"_id":"67fe6685a46135f50ac20ffe","avatarUrl":"/avatars/45ad1aafd1e6eda68efaebee0df6276a.svg","isPro":false,"fullname":"airlsyn","user":"airlsyn","type":"user"},{"_id":"69a3f9425a6a3b944eca9dd1","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/bR1035xHyZPegNQdgov0e.jpeg","isPro":false,"fullname":"Gabriel Campbell","user":"lucasr2026","type":"user"},{"_id":"652a9eeac59e682042ca1407","avatarUrl":"/avatars/b81dd2e3d182dc23a4f7b98e7be41c4b.svg","isPro":false,"fullname":"myZhang (SII)","user":"bearybear","type":"user"},{"_id":"64452110e1fd8d65b2790bfc","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64452110e1fd8d65b2790bfc/3NIsQFV9X44X88aLXR6eE.jpeg","isPro":false,"fullname":"Xiaohan Qin (SJTU) & (SII)","user":"galaxy-1","type":"user"},{"_id":"66cd633eb52f8dd53e690bbe","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/noauth/HokfXgUi_JWz_6WzF6KdK.jpeg","isPro":false,"fullname":"Yulin Sun (SII)","user":"ForrestSunn","type":"user"},{"_id":"661d21f306936ccab51d402b","avatarUrl":"/avatars/ae720e29e91ffc82b782143c48e2fc4e.svg","isPro":false,"fullname":"Shuochen Chang","user":"miraclecsc","type":"user"},{"_id":"61af81009f77f7b669578f95","avatarUrl":"/avatars/fb50773ac49948940eb231834ee6f2fd.svg","isPro":false,"fullname":"rotem israeli","user":"irotem98","type":"user"},{"_id":"666ec1cc4d6959477e237591","avatarUrl":"/avatars/3455e179dd7d27d4658b694c5a7bc47a.svg","isPro":false,"fullname":"Xiaoyang Liu","user":"Xiaoyang-Liu","type":"user"},{"_id":"66b7fdf25b3a757a743bac44","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/66b7fdf25b3a757a743bac44/f0-DBa-yD86LzCBaVpDZb.webp","isPro":false,"fullname":"Kitsun","user":"KitsuVp","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"686ccb41f10ff747aaea3f13","name":"Shanghai-Jiao-Tong-University-SAI","fullname":"Shanghai Jiao Tong University SAI","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/686cc925516900c6865a1d56/KmvImL_J2ZEHdVFumR3pO.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.24956.md"}">
Papers
arxiv:2605.24956

NITP: Next Implicit Token Prediction for LLM Pre-training

Published on May 24
· Submitted by
Xiangdong Zhang
on Jun 2
Authors:
,
,
,
,
,

Abstract

Next Implicit Token Prediction enhances language model training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead.

AI-generated summary

Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.

Community

Paper submitter about 8 hours ago

Accepted to ICML 2026.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.24956
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.24956 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.24956 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.24956 in a Space README.md to link it from this page.

Collections including this paper 1

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers