Hugging Face Daily Papers · May 15, 2026 · 4 min read

Long Context Pre-Training with Lighthouse Attention

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Like Read original ↗

We propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training.</p>\n","updatedAt":"2026-05-15T15:55:34.665Z","author":{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","fullname":"Bowen Peng","name":"bloc97","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"followerCount":35,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8875247836112976},"editors":["bloc97"],"editorAvatarUrls":["https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.06554","authors":[{"_id":"6a05d411b1a8cbabc9f094de","name":"Bowen Peng","hidden":false},{"_id":"6a05d411b1a8cbabc9f094df","name":"Subho Ghosh","hidden":false},{"_id":"6a05d411b1a8cbabc9f094e0","name":"Jeffrey Quesnelle","hidden":false}],"publishedAt":"2026-05-07T00:00:00.000Z","submittedOnDailyAt":"2026-05-15T00:00:00.000Z","title":"Long Context Pre-Training with Lighthouse Attention","submittedOnDailyBy":{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","isPro":false,"fullname":"Bowen Peng","user":"bloc97","type":"user","name":"bloc97"},"summary":"Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention","upvotes":15,"discussionId":"6a05d411b1a8cbabc9f094e5","projectPage":"https://nousresearch.com/lighthouse-attention","githubRepo":"https://github.com/ighoshsubho/lighthouse-attention","githubRepoAddedBy":"user","ai_summary":"Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance.","ai_keywords":["scaled dot-product attention","hierarchical attention","causal transformers","gradient-free","sequence length","attention mechanism","pre-training","training-only","recovery phase"],"githubStars":1,"organization":{"_id":"643b858ba856622f9790cc66","name":"NousResearch","fullname":"NousResearch","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"62cf262026c94b143173ef65","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62cf262026c94b143173ef65/FFcC-XhG5hMjApSvfk_he.png","isPro":false,"fullname":"Bowen Peng","user":"bloc97","type":"user"},{"_id":"639546df76d8296e1146480f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/639546df76d8296e1146480f/O-5dxEKMlxO7LREB3yYW-.jpeg","isPro":false,"fullname":"Subho Ghosh","user":"ighoshsubho","type":"user"},{"_id":"60d35154d7b174177faabd55","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/60d35154d7b174177faabd55/_if2cJtR1Um5R5xavy6Kk.jpeg","isPro":false,"fullname":"théo gigant","user":"gigant","type":"user"},{"_id":"630581db99870e13d3e0006f","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/1676652577978-630581db99870e13d3e0006f.jpeg","isPro":true,"fullname":"Jeffrey Quesnelle","user":"emozilla","type":"user"},{"_id":"6604853d69c47cd78fc832ec","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/6604853d69c47cd78fc832ec/nHDhDRfG9GufyAr6y67pM.jpeg","isPro":false,"fullname":"snav","user":"simpolism","type":"user"},{"_id":"6a04ac35f0622930b4d4c38f","avatarUrl":"/avatars/032b5054ce1f8f866af5b4ef2a6b60b3.svg","isPro":false,"fullname":"Grady Wells","user":"gmoney6","type":"user"},{"_id":"67bc6fd1cbde0977f1e12524","avatarUrl":"/avatars/f5fa4ef814ef2ac7c4deb9ee9497c43d.svg","isPro":false,"fullname":"Austin pickett","user":"phragg","type":"user"},{"_id":"62e0626a1b0ece20b8aaf2b8","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/62e0626a1b0ece20b8aaf2b8/TFFSkfqIcrHgoufx8HGya.png","isPro":false,"fullname":"neuralink","user":"neuralink","type":"user"},{"_id":"630eb171a97116192fae3a03","avatarUrl":"/avatars/805c72d9e780ce11674c6979f65bb96c.svg","isPro":false,"fullname":"Sam Herring","user":"samherring99","type":"user"},{"_id":"657ca56956f664691831394d","avatarUrl":"/avatars/7ad4ff476f3e3f302392a3c2d76decda.svg","isPro":false,"fullname":"Morgane Moss","user":"mormio","type":"user"},{"_id":"64b21cbb2fc8324fcb1dac03","avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/64b21cbb2fc8324fcb1dac03/6842IGsvYfcNxTrZau3-i.png","isPro":false,"fullname":"ari lotter","user":"apyh","type":"user"},{"_id":"632270be7bb41a713db09225","avatarUrl":"/avatars/febca473e0a7007535c7c724a5904047.svg","isPro":false,"fullname":"𒐪","user":"shl0","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"643b858ba856622f9790cc66","name":"NousResearch","fullname":"NousResearch","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/6317aade83d8d2fd903192d9/tPLjYEeP6q1w0j_G2TJG_.png"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.06554.md"}">

Papers

arxiv:2605.06554

Long Context Pre-Training with Lighthouse Attention

Published on May 7

· Submitted by

Bowen Peng on May 15

NousResearch

Upvote

Authors:

Abstract

Lighthouse Attention enables efficient training of causal transformers at long sequences by using hierarchical selection-based attention that reduces computational complexity while maintaining model performance.

AI-generated summary

Training causal transformers at extreme sequence lengths is bottlenecked by the quadratic time and memory of scaled dot-product attention (SDPA). In this work, we propose Lighthouse Attention, a training-only symmetrical selection-based hierarchical attention algorithm that wraps around ordinary SDPA and can be easily removed towards the end of the training. Our hierarchical selection is also gradient-free, which exempts us from dealing with a complicated and potentially inefficient backward pass kernel. Our contribution is three-fold: (i) A subquadratic hierarchical pre- and post-processing step that does adaptive compression and decompression of the sequence. (ii) A symmetrical compression strategy that pools queries, keys and values at the same time, while preserving left-to-right causality, which greatly improves parallelism. (iii) A two stage training approach which we pre-train for the majority of the time with Lighthouse Attention and recover a full attention model at the end with a short training. We run preliminary small scale LLM pre-training experiments that show the effectiveness of our method compared to full attention training with all other settings matched, where we achieve a faster total training time and lower final loss after the recovery phase. Full code is available at: https://github.com/ighoshsubho/lighthouse-attention