Hugging Face Daily Papers · · 5 min read

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Mirrored from Hugging Face Daily Papers for archival readability. Support the source by reading on the original site.

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior.</p>\n","updatedAt":"2026-05-13T04:28:00.270Z","author":{"_id":"6527c063e86758eb6ca800a1","avatarUrl":"/avatars/9091be87eea518209c1de9eebfa663c0.svg","fullname":"JarvisPei","name":"JarvisPei","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":0,"identifiedLanguage":{"language":"en","probability":0.8773201107978821},"editors":["JarvisPei"],"editorAvatarUrls":["/avatars/9091be87eea518209c1de9eebfa663c0.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.09932","authors":[{"_id":"6a02acdab823258e761235ca","name":"Zehua Pei","hidden":false},{"_id":"6a02acdab823258e761235cb","name":"Hui-Ling Zhen","hidden":false},{"_id":"6a02acdab823258e761235cc","name":"Xianzhi Yu","hidden":false},{"_id":"6a02acdab823258e761235cd","name":"Sinno Jialin Pan","hidden":false},{"_id":"6a02acdab823258e761235ce","name":"Mingxuan Yuan","hidden":false},{"_id":"6a02acdab823258e761235cf","name":"Bei Yu","hidden":false}],"publishedAt":"2026-05-11T00:00:00.000Z","submittedOnDailyAt":"2026-05-13T00:00:00.000Z","title":"FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning","submittedOnDailyBy":{"_id":"6527c063e86758eb6ca800a1","avatarUrl":"/avatars/9091be87eea518209c1de9eebfa663c0.svg","isPro":false,"fullname":"JarvisPei","user":"JarvisPei","type":"user","name":"JarvisPei"},"summary":"Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\\% to 81.1\\% at 16K; and on GPQA with agentic tool use, it yields a 24\\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529times and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT","upvotes":1,"discussionId":"6a02acdbb823258e761235d0","ai_summary":"Training framework FocuSFT improves long-context language model performance by addressing attention allocation issues through bilevel optimization with parametric memory that focuses attention on semantically relevant content.","ai_keywords":["attention budget","supervised fine-tuning","positional biases","attention sinks","bilevel optimization","fast-weight parameters","parametric memory","causal masking","attention dilution","context engagement","attention sink mass"]},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"6527c063e86758eb6ca800a1","avatarUrl":"/avatars/9091be87eea518209c1de9eebfa663c0.svg","isPro":false,"fullname":"JarvisPei","user":"JarvisPei","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.09932.md"}">
Papers
arxiv:2605.09932

FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

Published on May 11
· Submitted by
JarvisPei
on May 13
Authors:
,
,
,
,
,

Abstract

Training framework FocuSFT improves long-context language model performance by addressing attention allocation issues through bilevel optimization with parametric memory that focuses attention on semantically relevant content.

AI-generated summary

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior. On BABILong, FocuSFT improves accuracy by up to +14pp across 4K--32K context lengths; on RULER, it raises CWE aggregation from 72.9\% to 81.1\% at 16K; and on GPQA with agentic tool use, it yields a 24\% relative gain in pass@1. Attention analysis shows that FocuSFT reduces attention sink mass by 529times and triples context engagement during training. Code: https://github.com/JarvisPei/FocuSFT

Community

Paper submitter about 17 hours ago

Large language models can now process increasingly long inputs, yet their ability to effectively use information spread across long contexts remains limited. We trace this gap to how attention budget is spent during supervised fine-tuning (SFT) on long sequences: positional biases and attention sinks cause the model to allocate most of its attention to positionally privileged tokens rather than semantically relevant content. This training-time attention dilution (the starvation of content tokens in the attention distribution) weakens the gradient signal, limiting the model's ability to learn robust long-context capabilities. We introduce FocuSFT, a bilevel optimization framework that addresses this problem at training time. An inner loop adapts lightweight fast-weight parameters on the training context to form a parametric memory that concentrates attention on relevant content, and the outer loop performs SFT conditioned on this sharpened representation. Both loops apply bidirectional attention over context tokens while preserving causal masking for responses, reducing the causal asymmetry that gives rise to attention sinks and aligning inner-outer behavior.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images

· Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.09932
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.09932 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.09932 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.09932 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.

Discussion (0)

Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.

Sign in →

No comments yet. Sign in and be the first to say something.

More from Hugging Face Daily Papers