We propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training.</p>\n","updatedAt":"2026-05-15T15:55:34.665Z","author":{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","fullname":"Bowen Peng","name":"bloc97","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":35,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8875247836112976},"editors":["bloc97"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.06554","authors":[{"_id":"6a05d411b1a8cbabc9f094de","name":"Bowen Peng","hidden":false},{"_id":"6a05d411b1a8cbabc9f094df","name":"Subho Ghosh","hidden":false},{"_id":"6a05d411b1a8cbabc9f094e0","name":"Jeffrey Quesnelle","hidden":false}],"publishedAt":"2026-05-07T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Long Context Pre-Training with Lighthouse Attention","submittedOnDailyBy":{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","isPro":false,"fullname":"Bowen Peng","user":"bloc97","type":"user","name":"bloc97"},"summary":"Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention","upvotes":15,"discussionId":"6a05d411b1a8cbabc9f094e5","projectPage":"https://nousresearch.com/lighthouse-attention","githubRepo":"https://github.com/ighoshsubho/lighthouse-attention","githubRepoAddedBy":"user","ai_summary":"Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance.","ai_keywords":["scaled dot-product attention","hierarchical attention","causal transformers","gradient-free","sequence length","attention mechanism","pre-training","training-only","recovery phase"],"githubStars":1,"organization":{"_id":"643b858ba856622f9790cc66","name":"NousResearch","fullname":"NousResearch","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","isPro":false,"fullname":"Bowen Peng","user":"bloc97","type":"user"},{"_id":"639546df76d8296e1146480f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639546df76d8296e1146480f/O-5dxEKMlxO7LREB3yYW-.jpeg","isPro":false,"fullname":"Subho Ghosh","user":"ighoshsubho","type":"user"},{"_id":"60d35154d7b174177faabd55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d35154d7b174177faabd55/_if2cJtR1Um5R5xavy6Kk.jpeg","isPro":false,"fullname":"théo gigant","user":"gigant","type":"user"},{"_id":"630581db99870e13d3e0006f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676652577978-630581db99870e13d3e0006f.jpeg","isPro":true,"fullname":"Jeffrey Quesnelle","user":"emozilla","type":"user"},{"_id":"6604853d69c47cd78fc832ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6604853d69c47cd78fc832ec/nHDhDRfG9GufyAr6y67pM.jpeg","isPro":false,"fullname":"snav","user":"simpolism","type":"user"},{"_id":"6a04ac35f0622930b4d4c38f","avatarUrl":"/avatars/032b5054ce1f8f866af5b4ef2a6b60b3.svg","isPro":false,"fullname":"Grady Wells","user":"gmoney6","type":"user"},{"_id":"67bc6fd1cbde0977f1e12524","avatarUrl":"/avatars/f5fa4ef814ef2ac7c4deb9ee9497c43d.svg","isPro":false,"fullname":"Austin pickett","user":"phragg","type":"user"},{"_id":"62e0626a1b0ece20b8aaf2b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e0626a1b0ece20b8aaf2b8/TFFSkfqIcrHgoufx8HGya.png","isPro":false,"fullname":"neuralink","user":"neuralink","type":"user"},{"_id":"630eb171a97116192fae3a03","avatarUrl":"/avatars/805c72d9e780ce11674c6979f65bb96c.svg","isPro":false,"fullname":"Sam Herring","user":"samherring99","type":"user"},{"_id":"657ca56956f664691831394d","avatarUrl":"/avatars/7ad4ff476f3e3f302392a3c2d76decda.svg","isPro":false,"fullname":"Morgane Moss","user":"mormio","type":"user"},{"_id":"64b21cbb2fc8324fcb1dac03","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b21cbb2fc8324fcb1dac03/6842IGsvYfcNxTrZau3-i.png","isPro":false,"fullname":"ari lotter","user":"apyh","type":"user"},{"_id":"632270be7bb41a713db09225","avatarUrl":"/avatars/febca473e0a7007535c7c724a5904047.svg","isPro":false,"fullname":"𒐪","user":"shl0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643b858ba856622f9790cc66","name":"NousResearch","fullname":"NousResearch","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.06554.md"}">
Long Context Pre-Training with Lighthouse Attention
Abstract
Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance.
AI-generated summary
Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention
Community
We propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training.
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.06554 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.06554 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.06554 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.