Every time you train a network with ReLU, GELU, or SiLU, your weights quietly drift negative. Not because of your data, it happens on <em>random inputs</em> too. It's baked into the math of gradient descent + asymmetric activations.</p>\n<p>We prove it formally (MSE & cross-entropy) and show it across MLP, ResNet, ViT, GPT, and a speech model.</p>\n<p><strong>What does this drift do?</strong> Negative weights push pre-activations into negative regions, and with ReLU, up to 90% of activations end up being zero zeroed out by the very same function that caused the drift in the first place! Bug or feature? Depends on how to use it. </p>\n<p><strong>The most interesting finding:</strong> ReLU² boosts GPT-nano performance but it pathologically amplifies activation spikes by 25×. The fix is simple: <em>clip it</em>. Clipped ReLU² and GELU² both outperform their non squared versions, with GELU² achieving the best validation loss overall on GPT-nano.</p>\n<p>💻 Code: github.com/On-Point-RND/BugOrFeature</p>\n","updatedAt":"2026-05-20T20:06:06.014Z","author":{"_id":"65afde6ba0b4bf3b0e95b4e8","avatarUrl":"/avatars/e9b97040b0a619bf6609465d1678705c.svg","fullname":"Egor Shvetsov","name":"dalime","type":"user","isPro":false,"isHf":false,"isHfAdmin":false,"isMod":false,"isUserFollowing":false}},"numEdits":2,"identifiedLanguage":{"language":"en","probability":0.8948148488998413},"editors":["dalime"],"editorAvatarUrls":["/avatars/e9b97040b0a619bf6609465d1678705c.svg"],"reactions":[],"isReport":false}}],"primaryEmailConfirmed":false,"paper":{"id":"2605.17659","authors":[{"_id":"6a0e110d164dbbc68a26c378","name":"Egor Shvetsov","hidden":false},{"_id":"6a0e110d164dbbc68a26c379","name":"Aleksandr Serkov","hidden":false},{"_id":"6a0e110d164dbbc68a26c37a","name":"Shokorov Viacheslav","hidden":false},{"_id":"6a0e110d164dbbc68a26c37b","name":"Redko Dmitry","hidden":false},{"_id":"6a0e110d164dbbc68a26c37c","name":"Vladislav Goloshchapov","hidden":false},{"_id":"6a0e110d164dbbc68a26c37d","name":"Evgeny Burnaev","hidden":false}],"mediaUrls":["https://cdn-uploads.huggingface.co/production/uploads/65afde6ba0b4bf3b0e95b4e8/DGiPPU1nWZS1gR_mXQpjb.png"],"publishedAt":"2026-05-17T00:00:00.000Z","submittedOnDailyAt":"2026-05-20T00:00:00.000Z","title":"Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes","submittedOnDailyBy":{"_id":"65afde6ba0b4bf3b0e95b4e8","avatarUrl":"/avatars/e9b97040b0a619bf6609465d1678705c.svg","isPro":false,"fullname":"Egor Shvetsov","user":"dalime","type":"user","name":"dalime"},"summary":"The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.","upvotes":1,"discussionId":"6a0e110e164dbbc68a26c37e","githubRepo":"https://github.com/On-Point-RND/BugOrFeature","githubRepoAddedBy":"user","ai_summary":"Standard losses interacting with positively biased activation functions cause negative weight drift during early training, leading to significant activation sparsity and affecting model accuracy across various architectures.","ai_keywords":["weight drift","activation functions","ReLU","GELU","SiLU","MSE loss","cross-entropy loss","activation sparsity","transformer layers","GPT-nano","ReLU^2","GELU^2","clipping"],"githubStars":0,"organization":{"_id":"68306c2f22a11fbfe2151fe3","name":"On-Point-Rnd","fullname":"On-Point-Rnd","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65afde6ba0b4bf3b0e95b4e8/_LmX1FL4JkheTOEFV9MSM.jpeg"}},"canReadDatabase":false,"canManagePapers":false,"canSubmit":false,"hasHfLevelAccess":false,"upvoted":false,"upvoters":[{"_id":"65afde6ba0b4bf3b0e95b4e8","avatarUrl":"/avatars/e9b97040b0a619bf6609465d1678705c.svg","isPro":false,"fullname":"Egor Shvetsov","user":"dalime","type":"user"}],"acceptLanguages":["en"],"dailyPaperRank":0,"organization":{"_id":"68306c2f22a11fbfe2151fe3","name":"On-Point-Rnd","fullname":"On-Point-Rnd","avatar":"https://cdn-avatars.huggingface.co/v1/production/uploads/65afde6ba0b4bf3b0e95b4e8/_LmX1FL4JkheTOEFV9MSM.jpeg"},"markdownContentUrl":"https://huggingface.co/buckets/huggingchat/papers-content/resolve/2605/2605.17659.md"}">
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes
Abstract
Standard losses interacting with positively biased activation functions cause negative weight drift during early training, leading to significant activation sparsity and affecting model accuracy across various architectures.
AI-generated summary
The design of modern neural architectures has converged through incremental empirical choices, yet the mechanisms governing their training dynamics remain only partially understood. We identify and analyze a negative weight drift induced by the interaction between standard losses and positively biased activation functions. We prove that under MSE or cross-entropy loss, the gradient with respect to positive pre-activations is non-negative in expectation at initialization, driving downstream weights toward negative values during early training. The drift is intrinsic to optimization rather than data, and persists across architectures (MLP, ResNet, ViT, GPT-nano, MP-SENe) and asymmetric activation functions (ReLU, GELU, SiLU). Coupled with ReLU, weight drift produces activation sparsity reaching up to 90\% in GPT-nano. We characterize the sparsity-accuracy tradeoff across 79 configurations and identify a sharp accuracy cliff above sim70\% activation sparsity. While ReLU^2 achieves a good sparsity--accuracy ratio in GPT-nano, it pathologically amplifies identified activation spikes in intermediate transformer layers. Clipping resolves this while preserving the representational benefits of squaring: clipped ReLU^2 outperforms its unclipped version, and GELU^2 achieves the lowest validation loss on GPT-nano. Code is available at https://github.com/On-Point-RND/BugOrFeature.
Community
Every time you train a network with ReLU, GELU, or SiLU, your weights quietly drift negative. Not because of your data, it happens on random inputs too. It's baked into the math of gradient descent + asymmetric activations.
We prove it formally (MSE & cross-entropy) and show it across MLP, ResNet, ViT, GPT, and a speech model.
What does this drift do? Negative weights push pre-activations into negative regions, and with ReLU, up to 90% of activations end up being zero zeroed out by the very same function that caused the drift in the first place! Bug or feature? Depends on how to use it.
The most interesting finding: ReLU² boosts GPT-nano performance but it pathologically amplifies activation spikes by 25×. The fix is simple: clip it. Clipped ReLU² and GELU² both outperform their non squared versions, with GELU² achieving the best validation loss overall on GPT-nano.
💻 Code: github.com/On-Point-RND/BugOrFeature
Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.
Tap or paste here to upload images
Cite arxiv.org/abs/2605.17659 in a model README.md to link it from this page.
Cite arxiv.org/abs/2605.17659 in a dataset README.md to link it from this page.
Cite arxiv.org/abs/2605.17659 in a Space README.md to link it from this page.
Discussion (0)
Sign in to join the discussion. Free account, 30 seconds — email code or GitHub.
Sign in →No comments yet. Sign in and be the first to say something.